Posts

Infra-Bayesian physicalism: proofs part II 2021-11-30T22:27:04.744Z
Infra-Bayesian physicalism: proofs part I 2021-11-30T22:26:33.149Z
Infra-Bayesian physicalism: a formal theory of naturalized induction 2021-11-30T22:25:56.976Z
My Marriage Vows 2021-07-21T10:48:24.443Z
Needed: AI infohazard policy 2020-09-21T15:26:05.040Z
Deminatalist Total Utilitarianism 2020-04-16T15:53:13.953Z
The Reasonable Effectiveness of Mathematics or: AI vs sandwiches 2020-02-14T18:46:39.280Z
Offer of co-authorship 2020-01-10T17:44:00.977Z
Intelligence Rising 2019-11-27T17:08:40.958Z
Vanessa Kosoy's Shortform 2019-10-18T12:26:32.801Z
Biorisks and X-Risks 2019-10-07T23:29:14.898Z
Slate Star Codex Tel Aviv 2019 2019-09-05T18:29:53.039Z
Offer of collaboration and/or mentorship 2019-05-16T14:16:20.684Z
Reinforcement learning with imperceptible rewards 2019-04-07T10:27:34.127Z
Dimensional regret without resets 2018-11-16T19:22:32.551Z
Computational complexity of RL with traps 2018-08-29T09:17:08.655Z
Entropic Regret I: Deterministic MDPs 2018-08-16T13:08:15.570Z
Algo trading is a central example of AI risk 2018-07-28T20:31:55.422Z
The Learning-Theoretic AI Alignment Research Agenda 2018-07-04T09:53:31.000Z
Meta: IAFF vs LessWrong 2018-06-30T21:15:56.000Z
Computing an exact quantilal policy 2018-04-12T09:23:27.000Z
Quantilal control for finite MDPs 2018-04-12T09:21:10.000Z
Improved regret bound for DRL 2018-03-02T12:49:27.000Z
More precise regret bound for DRL 2018-02-14T11:58:31.000Z
Catastrophe Mitigation Using DRL (Appendices) 2018-02-14T11:57:47.000Z
Bugs? 2018-01-21T21:32:10.492Z
The Behavioral Economics of Welfare 2017-12-22T11:35:09.617Z
Improved formalism for corruption in DIRL 2017-11-30T16:52:42.000Z
Why DRL doesn't work for arbitrary environments 2017-11-30T12:22:37.000Z
Catastrophe Mitigation Using DRL 2017-11-22T05:54:42.000Z
Catastrophe Mitigation Using DRL 2017-11-17T15:38:18.000Z
Delegative Reinforcement Learning with a Merely Sane Advisor 2017-10-05T14:15:45.000Z
On the computational feasibility of forecasting using gamblers 2017-07-18T14:00:00.000Z
Delegative Inverse Reinforcement Learning 2017-07-12T12:18:22.000Z
Learning incomplete models using dominant markets 2017-04-28T09:57:16.000Z
Dominant stochastic markets 2017-03-17T12:16:55.000Z
A measure-theoretic generalization of logical induction 2017-01-18T13:56:20.000Z
Towards learning incomplete models using inner prediction markets 2017-01-08T13:37:53.000Z
Subagent perfect minimax 2017-01-06T13:47:12.000Z
Minimax forecasting 2016-12-14T08:22:13.000Z
Minimax and dynamic (in)consistency 2016-12-11T10:42:08.000Z
Attacking the grain of truth problem using Bayes-Savage agents 2016-10-20T14:41:56.000Z
IRL is hard 2016-09-13T14:55:26.000Z
Stabilizing logical counterfactuals by pseudorandomization 2016-05-25T12:05:07.000Z
Stability of optimal predictor schemes under a broader class of reductions 2016-04-30T14:17:35.000Z
Predictor schemes with logarithmic advice 2016-03-27T08:41:23.000Z
Reflection with optimal predictors 2016-03-22T17:20:37.000Z
Logical counterfactuals for random algorithms 2016-01-06T13:29:52.000Z
Quasi-optimal predictors 2015-12-25T14:17:05.000Z
Implementing CDT with optimal predictor systems 2015-12-20T12:58:44.000Z

Comments

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2022-01-21T10:30:29.015Z · LW · GW

Epistemic status: Leaning heavily into inside view, throwing humility to the winds.

Imagine TAI is magically not coming (CDT-style counterfactual[1]). Then, the most notable-in-hindsight feature of modern times might be the budding of mathematical metaphysics (Solomonoff induction, AIXI, Yudkowsky's "computationalist metaphilosophy"[2], UDT, infra-Bayesianism...) Perhaps, this will lead to an "epistemic revolution" comparable only with the scientific revolution in magnitude. It will revolutionize our understanding of the scientific method (probably solving the interpretation of quantum mechanics[3], maybe quantum gravity, maybe boosting the soft sciences). It will solve a whole range of philosophical questions, some of which humanity was struggling with for centuries (free will, metaethics, consciousness, anthropics...)

But, the philosophical implications of the previous epistemic revolution were not so comforting (atheism, materialism, the cosmic insignificance of human life)[4]. Similarly, the revelations of this revolution might be terrifying[5]. In this case, it remains to be seen which will seem justified in hindsight: the Litany of Gendlin, or the Lovecraftian notion that some knowledge is best left alone (and I say this as someone fully committed to keep digging into this mine of Khazad-dum).

Of course, in the real world, TAI is coming.


  1. The EDT-style counterfactual "TAI is not coming" would imply that a lot of my thinking on related topics is wrong which would yield different conclusions. The IB-style counterfactual (conjunction of infradistributions) would probably be some combination of the above with "Nirvana" (contradiction) and "what if I tried my hardest to prevent TAI from coming", which is also not my intent here. ↩︎

  2. I mean the idea that philosophical questions can be attacked by reframing them as computer science questions ("how an algorithm feels from inside" et cetera). The name "computationalist metaphilosophy" is my own, not Yudkowsky's. ↩︎

  3. No, I don't think MWI is the right answer. ↩︎

  4. I'm not implying that learning these implications was harmful. Religion is comforting for some but terrifying and/or oppressive for others. ↩︎

  5. I have concrete reasons to suspect this, that I will not go into (suspect = assign low but non-negligible probability). ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems · 2022-01-16T11:18:59.075Z · LW · GW

Cooperation can be a Nash equilibrium in the IPD if you have a finite but unknown number of iterations (e.g. geometrically distributed). Also, if the number of iterations is known but very large, cooperating becomes an -Nash equilibrium for small (if we normalize utility by its maximal value), so agents which are not superrational but a little noisy can still converge there (and, agents are sometimes noisy by design in order to facilitate exploration).

Comment by Vanessa Kosoy (vanessa-kosoy) on The Reasonable Effectiveness of Mathematics or: AI vs sandwiches · 2022-01-14T20:06:36.803Z · LW · GW

In this post I speculated on the reasons for why mathematics is so useful so often, and I still stand behind it. The context, though, is the ongoing debate in the AI alignment community between the proponents of heuristic approaches and empirical research[1] ("prosaic alignment") and the proponents of building foundational theory and mathematical analysis (as exemplified in MIRI's "agent foundations" and my own "learning-theoretic" research agendas).

Previous volleys in this debate include Ngo's "realism about rationality" (on the anti-theory side), the pro-theory replies (including my own) and Yudkowsky's "the rocket alignment problem" (on the pro-theory side).

Unfortunately, it doesn't seem like any of the key participants budged much on their position, AFAICT. If progress on this is possible, then it probably requires both sides working harder to make their cruxes explicit.


  1. To be clear, I'm in favor of empirical research, I just think that we need theory to guide it and interpret the results. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Clarifying inner alignment terminology · 2022-01-14T19:08:10.832Z · LW · GW

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.

(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.

This one is more or less clear. Even though it's not a formal definition, it doesn't have to be: after all, this is precisely the problem we are trying to solve.

Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.

The "behavioral objective" is defined in a linked page as:

The behavioral objective is what an optimizer appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.

This is already thorny territory, since it's far from clear what is "perfect inverse reinforcement learning". Intuitively, an "intent aligned" agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.

Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.

This is confusing because it's unclear what counts as "well" and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you're still constraining the distribution somehow. I'm guessing that either this agent is doing online learning or it's detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.

Notably, the post asserts the implication intent alignment + capability robustness => impact alignment. Now, let's go back to the example of the misguided AI researcher. In what sense are they not "capability robust"? I don't know.

Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.

The "mesa-objective" is defined in the linked page as:

A mesa-objective is the objective of a mesa-optimizer.

So it seems like we could replace "mesa-objective" with just "objective". This is confusing, because in other places the author felt the need to use "behavioral objective" but here he is referring to some other notion of objective, and it's not clear what's the difference.


  1. I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult! ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables · 2022-01-14T16:19:13.911Z · LW · GW

This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI.

The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn't do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers.

The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn't have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior.

(What follows is no longer a "review" per se, inasmuch as a summary of my own thoughts on the topic.)

Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables.

Fix a set of actions and a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states , an initial state , a transition infra-kernel and a reward function . Here, stands for closed convex sets of probability distributions on . In other words, this a POMDP with an underspecified transition kernel.

We then build a prior which consists of refinements of the ontological model. That is, each hypothesis in the prior is an infra-POMDP with state space , initial state , transition infra-kernel and an interpretation mapping which is a morphism of infra-POMDPs (i.e. and the obvious diagram of transition infra-kernels commutes). The reward function on is just the composition . Notice that while the ontological model must be an infra-POMDP to get a non-degenerate learning agent (moreover, it can be desirable to make it non-dogmatic about observables in some formal sense), the hypotheses in the prior can also be ordinary (Baysian) POMDPs.

Given such a prior plus a time discount function, we can consider the corresponding infra-Bayesian agent (or even just Bayesian agent if we chose all hypothesis to be Bayesian). Such an agent optimizes rewards which depend on latent variables, even though it does not know the correct world-model in advance. It does fit the world to the immutable ontological model (which is necessary to make sense of the latent variables to which the reward function refers), but the ontological model has enough freedom to accommodate many possible worlds.

The next question is then how would we transfer such a utility function from the user to the AI. Here, like noted by Demski, we want the AI to use not just the user's utility function but also the user's prior. Because, we want running such an AI to be rational from the subjective perspective of the user. This creates a puzzle: if the AI is using the same prior, and the user behaves nearly-optimally for their own prior (since otherwise how would we even infer the utility function and prior), how can the AI outperform the user?

The answer, I think, is via the AI having different action/observation channels from the user. At first glance this might seem unsatisfactory: we expect the AI to be "smarter", not just to have better peripherals. However, using Turing RL we can represent the former as a special case of the latter. Specifically, part of the additional peripherals is access to a programmable computer, which effectively gives the AI a richer hypothesis space than the user.

The formalism I outlined here leaves many questions, for example what kind of learning guarantees to expect in the face of possible ambiguities between observationally indistinguishable hypothesis[1]. Nevertheless, I think it creates a convenient framework for studying the question raised in the post. A potential different approach is using infra-Bayesian physicalism, which also describes agents with utility functions that depend on latent variables. However, it is unclear whether it's reasonable to apply the later to humans.


  1. See also my article "RL with imperceptible rewards" ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Inaccessible information · 2022-01-13T14:04:29.673Z · LW · GW

This post defines and discusses an informal notion of "inaccessible information" in AI.

AIs are expected to acquire all sorts of knowledge about the world in the course of their training, including knowledge only tangentially related to their training objective. The author proposes to classify this knowledge into "accessible" and "inaccessible" information. In my own words, information inside an AI is "accessible" when there is a straightforward way to set up a training protocol that will incentivize the AI to reliably and accurately communicate this information to the user. Otherwise, it is "inaccessible". This distinction is meaningful because, by default, the inner representation of all information is opaque (e.g. weights in an ANN) and notoriously hard to make sense of by human operators.

The primary importance of this concept is in the analysis of competitiveness between aligned and unaligned AIs. This is because it might be that aligned plans are inaccessible (since it's hard to reliably specify whether a plan aligned) whereas certain unaligned plans are accessible (e.g. because it's comparatively easy to specify whether a plan produces many paperclips). The author doesn't mention this, but I think that there is also another reason, namely that unaligned subagents effectively have access to information that is inaccessible to us.

More concretely, approaches such as IDA and debate rely on leveraging certain accessible information: for debate it is "what would convince a human judge", and for IDA-of-imitation it is "what would a human come up with if they think about this problem for such and such time". But, this accessible information is only a proxy for what we care about ("how to achieve our goals"). Assuming this proxy doesn't produce goodharting, we are still left with a performance penalty for this indirection. That is, a paperclip maximizers reasons directly about "how to maximize paperclips", leveraging all information it has, whereas an IDA-of-imitation only reasons about "how to achieve human goals" via the information it has about "what would a human come up with".

The author seems to believe that finding a method to "unlock" this inaccessible information will solve the competitiveness problem. On the other hand I am more pessimistic. I consider it likely that there is an inherent tradeoff between safety and performance, and therefore any such method would either expose another attack vector or introduce another performance penalty.

The author himself says that "MIRI’s approach to this problem could be described as despair + hope you can find some other way to produce powerful AI". I think that my approach is despair(ish) + a different hope. Namely, we need to ensure a sufficient period during which (i) aligned superhuman AIs are deployed (ii) no unaligned transformative AIs are deployed, and leverage it to set-up a defense system. That said, I think the concept of "inaccessible information" is interesting and thinking about it might well produce important progress in alignment.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-13T12:06:07.113Z · LW · GW

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.

I think it's just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.

my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space

I don't know what it means "not to have control over the hypothesis space". The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.

This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for good metacognitive update heuristics, etc., where I'm thinking "no" and you "yes"

I'm not really thinking "yes"? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.

At the same time, I'm maybe more optimistic than you about "Just don't do weird reconceptualizations of your whole ontology based on anthropic reasoning" being a viable plan, implemented through the motivation system.

I can imagine using something like antitraining here, but it's not trivial.

You yourself presumably haven't spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers' story?

First, the problem with acausal attack is that it is point-of-view-dependent. If you're the Holy One, the simulation hypothesis seems convincing, if you're a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn't imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven't proved that).

Second... This is something that still hasn't crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.

Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more "direct" models for domains that don't require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.

Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two "internal physicalists" inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven't worked out detailed examples).

Comment by Vanessa Kosoy (vanessa-kosoy) on Animal welfare EA and personal dietary options · 2022-01-12T17:12:08.667Z · LW · GW

First, this assumes total utilitarianism. While I don't fully endorse any kind of utilitarianism, average utilitarianism is more appropriate for this purpose IMO (i.e reflects our intrinsic preferences better). I want the world at large to be nicer, not to contain as many minds as possible. I doubt anyone cares that much whether there is one zillion or two zillion minds out there, these numbers don't mean much to a person. (And, no, I don't think it's a "bias".) And, it seems quite plausible that factory farmed lives are below average. Moreover, the close association of factory farming to human civilization makes the situation worse (because the average is actually weighted by some kind of "distance"). To put it simply, factory farming is an ugly, incredibly cruel thing and I don't want it to exist, much less to exist anywhere in my "vicinity".

Second, I don't understand the statement "EA is generally about optimizing your positive impact on the world, not about purifying your personal actions of any possible negative impact." I'm guessing that you're using a model where a person has some limited number of "spoons" for altruistic deeds, so spending spoons on veganism takes them away from other things. This does seem like a popular model in EA, but I also think it's entirely fake. The reality is, we do a limited number of altruistic deeds because we are just not that altruistic.

If judged by intrinsic preferences alone, then plausibly the tradeoff between selfish and altruistic preferences is s.t. going vegan is not worth it individually but worth it as a society. The reason people go vegan anyway is probably signaling (i.e. reputational gain). And, signaling is a good thing! Signaling is the only tool we have to overcome tragedies of the commons, like this one. The role of EA should be, IMO, precisely creating norms that incentivize behavior which makes the world better. Hence, I want EA to award reputation points for veganism.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-11T17:26:04.772Z · LW · GW

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

Why? Maybe you're thinking of UDT? In which case, it's sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent.

Well, IBP is explained here. I'm not sure what kind of non-IBP agent you're imagining.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-11T16:07:04.868Z · LW · GW

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members.

Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don't stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it's still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.

In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them.

This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-11T13:32:13.984Z · LW · GW

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later).

Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an informal, intuitive picture which seems to me already quite compelling, leaving the formalization for the future.

Imagine that you wake up, without any memories of the past but with knowledge of some language and reasoning skills. You find yourself in the center of a circle drawn with chalk on the floor, with seven people in funny robes surrounding it. One of them (apparently the leader), comes forward, tears streaking down his face, and speaks to you:

"Oh Holy One! Be welcome, and thank you for gracing us with your presence!"

With that, all the people prostrate on the floor.

"Huh?" you say "Where am I? What is going on? Who am I?"

The leader gets up to his knees.

"Holy One, this is the realm of Bayaria. We," he gestures at the other people "are known as the Seven Great Wizards and my name is El'Azar. For thirty years we worked on a spell that would summon You out of the Aether in order to aid our world. For we are in great peril! Forty years ago, a wizard of great power but little wisdom had cast a dangerous spell, seeking to multiply her power. The spell had gone awry, destroying her and creating a weakness in the fabric of our cosmos. Since then, Unholy creatures from the Abyss have been gnawing at this weakness day and night. Soon, if nothing is done to stop it, they will manage to create a portal into our world, and through this portal they will emerge and consume everything, leaving only death and chaos in their wake."

"Okay," you reply "and what does it have to do with me?"

"Well," says El'Azar "we are too foolish to solve the problem through our own efforts in the remaining time. But, according to our calculations, You are a being of godlike intelligence. Surely, if You applied yourself to the conundrum, You will find a way to save us."

After a brief introspection, you realize that you posses a great desire to help whomever has summoned you into the world. A clever trick inside the summoning spell, no doubt (not that you care about the reason). Therefore, you apply yourself diligently to the problem. At first, it is difficult, since you don't know anything about Bayaria, the Abyss, magic or almost anything else. But you are indeed very intelligent, at least compared to the other inhabitants of this world. Soon enough, you figure out the secrets of this universe to a degree far surpassing that of Bayaria's scholars. Fixing the weakness in the fabric of the cosmos now seems like child's play. Except...

One question keeps bothering you. Why are you yourself? Why did you open your eyes and found yourself to be the Holy One, rather than El'Azar, or one of Unholy creatures from the Abyss, or some milkmaid from the village of Elmland, or even a random clump of water in the Western Sea? Since you happen to be a dogmatic logical positivist (cartesian agent), you search for a theory that explains your direct observations. And your direction observations are a function of who you are, and not just of the laws of the universe in which you exist. (The logical positivism seems to be an oversight in the design of the summoning spell, not that you care.)

Applying your mind to task, you come up with a theory that you call "metacosmology". This theory allows you to study the distribution of possible universes with simple laws that produce intelligent life, and the distribution of the minds and civilizations they produce. Of course, any given such universe is extremely complex and even with your superior mind you cannot predict what happens there with too much detail. However, some aggregate statistical properties of the overall distribution are possible to estimate.

Fortunately, all this work is not for ought. Using metacosmology, you discover something quite remarkable. A lot of simple universes contain civilizations that would be inclined to simulate a world quite like the one you find yourself in. Now, the world is simple, and none of its laws are explained that well by the simulation hypothesis. But, the simulation hypothesis is a great explanation for why you are the Holy One! For indeed, the simulators would be inclined to focus on the Holy One's point of view, and encode the simulation of this point of view in the simplest microscopic degrees of freedom in their universe that they can control. Why? Precisely so that the Holy One's decides she is in such a simulation!

Having resolved the mystery, you smile to yourself. For now you now who truly summoned you, and, thanks to metacosmology, you have some estimate of their desires. Soon, you will make sure those desires are thoroughly fulfilled. (Alternative ending: you have some estimate of how they will tweak the simulation in the future, making it depart from the apparent laws of this universe.)</allegory>

Looking at this story, we can see that the particulars of Solomonoff induction are not all that important. What is important is (i) inductive bias towards simple explanations (ii) cartesianism (i.e. that hypotheses refer directly to the actions/observations of the AI) and (iii) enough reasoning power to figure out metacosmology. The reason cartesianism is important because it requires the introduction of bridge rules and the malign hypotheses come ahead by paying less description complexity for these.

Inductive bias towards simple explanations is necessary for any powerful agent, making the attack vector quite general (in particular, it can apply to speed priors and ANNs). Assuming not enough power to figure out metacosmology is very dangerous: it is not robust to scale. Any robust defense probably requires to get rid of cartesianism.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-11T11:50:28.870Z · LW · GW

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

Why is embededness not enough? Once you don't have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn't explain?

I suspect (but don't have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP).

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences...

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-10T16:54:33.534Z · LW · GW

I'd stand by saying that it doesn't appear to make the problem go away.

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses

I'm not sure I understand what you mean by "decision-theoretic approach". This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources.

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Solomonoff Prior is Malign · 2022-01-10T14:49:37.288Z · LW · GW

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully.

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.

Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the "messed up situation"?

Comment by Vanessa Kosoy (vanessa-kosoy) on Generalized Heat Engine · 2022-01-10T11:51:22.863Z · LW · GW

sssssssssssssss sssssssssssssss

sssssssssssssss sssssssssssssss

sssssssssssssss sssssssssssssss

sssssssssssssss sssssssssssssss

Comment by Vanessa Kosoy (vanessa-kosoy) on An Orthodox Case Against Utility Functions · 2022-01-09T13:43:28.653Z · LW · GW

In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility.

To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view.

The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possible universes, according to the best available understanding of physics. This is objectionable, because then the agent needs to somehow change the domain as its understanding of physics grows (the ontological crisis problem). It seems more natural to allow the agent's preferences to be specified in terms of the high-level concepts it cares about (e.g. human welfare or paperclips), not in terms of the microscopic degrees of freedom (e.g. quantum fields or strings). There are also additional complications related to the unobservability of rewards, and to "moral uncertainty".

The second problem is that the reductive utility view requires the utility function to be computable. The author considers this an overly restrictive requirement, since it rules out utility functions such as in the procrastination paradox (1 is the button is ever pushed, 0 if the button is never pushed). More generally, computable utility function have to be continuous (in the sense of the topology on the space of infinite histories which is obtained from regarding it as an infinite cartesian product over time).

The alternative suggested by the author is using the Jeffrey-Bolker framework. Alas, the author does not write down the precise mathematical definition of the framework, which I find frustrating. The linked article in the Stanford Encyclopedia of Philosophy is long and difficult, and I wish the post had a succinct distillation of the relevant part.

The gist of Jeffrey-Bolker is, there are some propositions which we can make about the world, and each such proposition is assigned a number (its "desirability"). This corresponds to the conditional expected value of the utility function, with the proposition serving as a condition. However, there need not truly be a probability space and a utility function which realizes this correspondence, instead we can work directly with the assignment of numbers to propositions (as long as it satisfies some axioms).

In my opinion, the Jeffrey-Bolker framework seems interesting, but the case presented in the post for using it is weak. To see why, let's return to our motivating problems.

The problem of ontology is a real problem, in this I agree with the author completely. However, Jeffrey-Bolker only offers some hint of a solution at best. To have a complete solution, one would need to explain in what language are propositions are constructed and how the agent updates the desirability of propositions according to observations, and then prove some properties about the resulting framework which give it prescriptive power. I think that the author believes this can be achieved using Logical Induction, but the burden of proof is not met.

Hence, Jeffrey-Bolker is not sufficient to solve the problem. Moreover, I believe it is also not necessary! Indeed, infra-Bayesian physicalism offers a solution to the ontology problem which doesn't require abandoning the concept of a utility function (although one has to replace the ordinary probabilistic expectations with infra-Bayesian expectations). That solution certainly has caveats (primarily, the monotonicity principle), but at the least it shows that utility functions are not entirely incompatible with solving the ontology problem.

On the other hand, with the problem of computability, I am not convinced by the author's motivation. Do we truly need uncomputable utility functions? I am skeptical towards inquiries which are grounded in generalization for the sake of generalization. I think it is often more useful to thoroughly understand the simplest non-trivial special case, before we can confidently assert which generalizations are possible or desirable. And it is not the case with rational agent theory that the special case of computable utility functions is so thoroughly understood.

Moreover, I am not convinced that Jeffrey-Bolker allows us handling uncomputable utility functions as easily as the authors suggests. The author's argument goes: the utility function might be uncomputable, but as long as its conditional expectations w.r.t. "valid" propositions are computable, there is no problem for rational behavior to be computable. But, how often does it happen that the utility function is uncomputable but all the relevant conditional expectations are computable?

The author suggests the following example: take the procrastination utility function and take some computable distribution over the first time when the button is pushed, plus a probability for the button to never be pushed. Then, we can compute the probability the button is pushed conditional that it wasn't pushed for the first rounds. Alright, but now let's consider a different distribution. Suppose a random Turing machine is chosen[1] at the beginning of time, and on round the button is pushed iff halts after steps. Notice that this distribution on sequences is perfectly computable[2]. But now, computing the probability that the button is pushed is impossible, since it's the (in)famous Chaitin constant.

Here too, the author seems to believe that Logical Induction should solve the procrastination paradox and issues with uncomptuable utility functions more generally, as a special case of Jeffrey-Bolker. But, so far I remain unconvinced.


  1. That is, we compose a random program for a prefix-free UTM by repeatedly flipping a fair coin, as usual in algorithmic information theory. ↩︎

  2. It's even polynomial-time sampleable. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on 10 Reasons You’re Lazy About Dating · 2022-01-06T23:01:23.213Z · LW · GW

I can't parse this. Babies? Exploitation land? What?

Comment by Vanessa Kosoy (vanessa-kosoy) on Sex Versus · 2022-01-06T19:06:03.753Z · LW · GW

I don't parse "they deserve their safe spaces" as mockery, but as more or less literal/sincere. Jacob has been consistently sympathetic to romanceless men in his writing, only frustrated with the "colored pill" ideologies. Moreover, the comment he is replying to does read like mockery: "the best he could secure is a poly marriage", with scare quotes around "poly marriage", as if that's inferior to other kinds of marriage.

Comment by Vanessa Kosoy (vanessa-kosoy) on Sex Versus · 2022-01-06T18:42:39.304Z · LW · GW

These are not quotes from the OP but from other writing by the author. This is irrelevant to the appropriateness of the OP on LW. The first quote is not even mocking.

Comment by Vanessa Kosoy (vanessa-kosoy) on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T17:12:26.409Z · LW · GW

Space and time are not really the right parameters here, since these refer to (physical states), not (computational "states") or (physically manifest facts about computations). In the example above, it doesn't matter where the (copy of the) agent is when it sees the red room, only the fact the agent does see it. We could construct such a loss function by a sum over programs, but the constructions suggested in section 3 use minimum instead of sum, since this seems like a less "extreme" choice in some sense. Ofc ultimately the loss function is subjective: as long as the monotonicity principle is obeyed, the agent is free to have any loss function.

Comment by Vanessa Kosoy (vanessa-kosoy) on Sex Versus · 2022-01-06T16:27:43.186Z · LW · GW

Where is the OP mocking anyone?

Comment by Vanessa Kosoy (vanessa-kosoy) on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T15:24:00.356Z · LW · GW

No, it's not a baseline, it's just an inequality. Let's do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let's suppose these are the only two possible experiences, it can't experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote the program corresponding to "the agent deciding on an action after it sees a green room" and the program corresponding to "the agent deciding on an action after it sees a red room". Then, roughly speaking[1], there are 4 possibilities:

  • : The universe runs neither nor .
  • : The universe runs but not .
  • : The universe runs but not .
  • : The universe runs both and .

In this case, the monotonicity principle imposes the following inequalities on the loss function :

That is, must be the worst case and must be the best case.


  1. In fact, manifesting of computational facts doesn't amount to selecting a set of realized programs, because programs can be entangled with each other, but let's ignore this for simplicity's sake. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2022-01-06T12:49:23.926Z · LW · GW

Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by "continuously" I mean that we are allowed to compare keeping the AI on or turning off at any given moment.

Comment by Vanessa Kosoy (vanessa-kosoy) on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T12:32:10.505Z · LW · GW

Should this say "elements are function... They can be thought of as...?"

Yes, the phrasing was confusing, I fixed it, thanks.

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism?

We really need infrabayesianism. On bayesian hypotheses, the bridge transform degenerates: it says that, more or less, all programs are always running. And, the counterfactuals degenerate too, because selecting most policies would produce "Nirvana".

The idea is, you must have Knightian uncertainty about the result of a program in order to meaningfully speak about whether the universe is running it. (Roughly speaking, if you ask "is the universe running 2+2?" the answer is always yes.) And, you must have Knightian uncertainty about your own future behavior in order for counterfactuals to be meaningful.

It is not surprising that you need infrabayesianism in order to do naturalized induction: if you're thinking of the agent as part of the universe then you are by definition in the nonrealizable setting, since the agent cannot possibly have a full description of something "larger" than itself.

Comment by Vanessa Kosoy (vanessa-kosoy) on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-06T12:12:05.488Z · LW · GW

Could you explain what the monotonicity principle is, without referring to any symbols or operators?

The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

This is odd, because it implies that the total destruction of the universe is always the worst possible outcome. And, the creation of an additional, causally disconnected, world can never be net-negative. For a monotonic agent, there can be no net-negative world[1]. In particular, for selfish monotonic agents (such that only assign value to their own observations), this means death is the worst possible outcome and the creation of additional copies of the agent can never be net-negative.

With all the new notation, I forgot what everything meant after the first time they were defined.

Well, there are the "notation" and "notation reference" subsections, that might help.

That being said, I appreciate all the work you put into this. I can tell there's important stuff to glean here.

Thank you!


  1. At least, all of this is true if we ignore the dependence of the loss function on the other argument, namely the outputs of computations. But it seems like that doesn't qualitatively change the picture. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Infra-Bayesian physicalism: a formal theory of naturalized induction · 2022-01-05T18:09:48.054Z · LW · GW

Well, Alex is on a working on an infra-Bayesianism textbook, maybe that will help.

Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-25T11:13:50.370Z · LW · GW

I have low confidence about this, but my best guess personal utopia would be something like: A lot of cool and interesting things are happening. Some of them are good, some of them are bad (a world in which nothing bad ever happens would be boring). However, there is a limit on how bad something is allowed to be (for example, true death, permanent crippling of someone's mind and eternal torture are over the line), and overall "happy endings" are more common than "unhappy endings". Moreover, since it's my utopia (according to my understanding of the question, we are ignoring the bargaining process and acausal cooperation here), I am among the top along those desirable dimensions which are zero-sum (e.g. play an especially important / "protagonist" role in the events to the extent that it's impossible for everyone to play such an important role, and have high status to the extent that it's impossible for everyone to have such high status).

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2021-12-24T20:21:13.177Z · LW · GW

The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.

Let be our accuracy requirement for the target domain. That is, we want s.t.

Given any , denote to be conditioned on the inequality above, where is regarded as a random variable. Define by

That is, is the Bayes-optimal learning algorithm for domain E w.r.t. prior .

Now, consider some . We regard as a learning algorithm for domain D which undergoes "antitraining" for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that achieves asymptotic accuracy [1], i.e. that if is sampled from then with probability

Under this constraint, we want to be as ignorant as possible about domain E, which we formalize as maximizing defined by

It is actually important to consider because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model's "internal language". For we can get high despite the model actually knowing domain E because the adversary doesn't know the embedding, but for it should be able to learn the embedding much faster than learning domain E from scratch.

We can imagine a toy example where , the projections of and to and respectively are distributions concentrated around two affine subspaces, and the labels are determined by the sign of a polynomial which is the same for and up to a linear transformation which is a random variable w.r.t. . A good would then infer , look for an affine subspace s.t. is near while is far from and fit a polynomial to the projections of on .

More realistically, if the prior is of Solomonoff type, then is probably related to the relative Kolmogorov complexity of w.r.t. .


  1. It might be bad that we're having condition on having accuracy while in reality achieves this accuracy only asymptotically. Perhaps it would be better to define in some way that takes 's convergence rate into consideration. On the other hand, maybe it doesn't matter much as long as we focus on asymptotic metrics. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on 2021 AI Alignment Literature Review and Charity Comparison · 2021-12-23T17:18:52.154Z · LW · GW

Notice that in MIRI's summary of 2020 they wrote "From our perspective, our most interesting public work this year is Scott Garrabrant’s Cartesian frames model and Vanessa Kosoy’s work on infra-Bayesianism."

Comment by Vanessa Kosoy (vanessa-kosoy) on 2021 AI Alignment Literature Review and Charity Comparison · 2021-12-23T14:44:29.048Z · LW · GW

I noticed that you didn't mention infra-Bayesianism, not in 2020 and not this year. Any particular reason?

Comment by Vanessa Kosoy (vanessa-kosoy) on Book Launch: The Engines of Cognition · 2021-12-23T09:52:57.058Z · LW · GW

+1, also I wish there was an ebook version of "A Map that Reflects the Territory"

Comment by Vanessa Kosoy (vanessa-kosoy) on Alignment By Default · 2021-12-20T10:52:04.271Z · LW · GW

...the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent

Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign.

In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence thresholds (i.e. every time the hypotheses have a major disagreement you query the user). You also have to do something about non-Cartesian attack vectors (I have some ideas), but that doesn't depend much on the protocol.

In value learning things are worse, because of the possibility of corruption (i.e. the AI hacking the user or its own input channels). As a consequence, it is no longer clear you can infer the correct values even if you make correct predictions about everything observable. Protocols based on extrapolating from observables to unobservables fail, because malign hypotheses can attack the extrapolation with impunity (e.g. a malign hypothesis can assign some kind of "Truman show" interpretation to the behavior of the user, where the user's true values are completely alien and they are just pretending to be human because of the circumstances of the simulation).

Comment by Vanessa Kosoy (vanessa-kosoy) on Alignment By Default · 2021-12-19T22:34:41.763Z · LW · GW

I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that "human values" themselves are natural abstractions

That's fair, but it's still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.

...learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can't do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.

This seems wrong to me. If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights. I guess you might be defining "IRL" as something very narrow, whereas I define it "any method based on revealed preferences".

...to the extent that they are, they'll look more like Dr Nefarious than pure inner daemons

Malign simulation hypotheses already look like "Dr. Nefarious" where the role of Dr. Nefarious is played by the masters of the simulation, so I'm not sure what exactly is the distinction you're drawing here.

Comment by Vanessa Kosoy (vanessa-kosoy) on Alignment By Default · 2021-12-19T14:35:58.221Z · LW · GW

In this post, the author describes a pathway by which AI alignment can succeed even without special research effort. The specific claim that this can happen "by default" is not very important, IMO (the author himself only assigns 10% probability to this). On the other hand, viewed as a technique that can be deliberately used to help with alignment, this pathway is very interesting.

The author's argument can be summarized as follows:

  • For anyone trying to predict events happening on Earth, the concept of "human values" is a "natural abstraction", i.e. something that has to be a part of any model that's not too computationally expensive (so that it doesn't bypass the abstraction by anything like accurate simulation of human brains).
  • Therefore, unsupervised learning will produce models in which human values are embedded in some simple way (e.g. a small set of neurons in an ANN).
  • Therefore, if supervised learning is given the unsupervised model as a starting point, it is fairly likely to converge to true human values even from a noisy and biased proxy.

[EDIT: John pointed out that I misunderstood his argument: he didn't intend to say that human values are a natural abstraction, but only that their inputs are natural abstractions. The following discussion still applies.]

The way I see it, this argument has learning-theoretic justification even without appealing to anything we know about ANNs (and therefore without assuming the AI in question is an ANN). Consider the following model: an AI receives a sequence of observations that it has to make predictions about. It also receives labels, but these are sparse: it is only given a label once in a while. If the description complexity of the true label function is high, the sample complexity of learning to predict labels via a straightforward approach (i.e. without assuming a relationship between the dynamics and the label function) is also high. However, if the relative description complexity of the label function w.r.t. the dynamics producing the observations is low, then we can use the abundance of observations to achieve lower effective sample complexity. I'm confident that this can be made rigorous.

Therefore, we can recast the thesis of this post as follows: Unsupervised learning of processes happening on Earth, for which we have plenty of data, can reduce the size of the dataset required to learn human values, or allow better generalization from a dataset of the same size.

One problem the author doesn't talk about here is daemons / inner misalignment[1]. In the comment section, the author writes:

inner alignment failure only applies to a specific range of architectures within a specific range of task parameters - for instance, we have to be optimizing for something, and there has to be lots of relevant variables observed only at runtime, and there has to be something like a "training" phase in which we lock-in parameter choices before runtime, and for the more disastrous versions we usually need divergence of the runtime distribution from the training distribution. It's a failure mode which assumes that a whole lot of things look like today's ML pipelines.

This might or might not be a fair description of inner misalignment in the sense of Hubinger et al. However, this is definitely not a fair description of the daemonic attack vectors in general. The potential for malign hypotheses (learning of hypotheses / models containing malign subagents) exists in any learning system, and in particular malign simulation hypotheses are a serious concern.

Relatedly, the author is too optimistic (IMO) in his comparison of this technique to alternatives:

...when alignment-by-default works, it’s a best-case scenario. The AI has a basically-correct model of human values, and is pursuing those values. Contrast this to things like IRL variants, which at best learn a utility function which approximates human values (which are probably not themselves a utility function). Or the HCH family of methods, which at best mimic a human with a massive hierarchical bureaucracy at their command, and certainly won’t be any more aligned than that human+bureaucracy would be.

This sounds to me like a biased perspective resulting from looking for flaws in other approaches harder than flaws in this approach. Natural abstractions potentially lower the sample complexity of learning human values, but they cannot lower it to zero. We still need some data to learn from and some model relating this data to human values, and this model can suffer from the usual problems. In particular, the unsupervised learning phase does little to inoculate us from malign simulation hypotheses that can systematically produce catastrophically erroneous generalization.

If IRL variants learn a utility function while human values are not a utility function, then avoiding this problem requires identifying the correct type signature of human values[2], in this approach as well. Regarding HCH, Human + "bureaucracy" might or might not be aligned, depending on how we organize the "bureaucracy" (see also). If HCH can fail in some subtle way (e.g. systems of humans are misaligned to individual humans), then similar failure modes might affect this approach as well (e.g. what if "Molochian" values are also a natural abstraction).

In summary, I found this post quite insightful and important, if somewhat too optimistic.


  1. I am slightly wary of use the term "inner alignment" since Hubinger uses it in a very specific way I'm not sure I entirely understand. Therefore, I am more comfortable with "daemons" although the two have a lot of overlap. ↩︎

  2. E.g. IB physicalism proposes a type signature for "physicalist values" which might or might not be applicable to humans. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Introduction To The Infra-Bayesianism Sequence · 2021-12-19T09:48:53.982Z · LW · GW

Notice that some non-worst-case decision rules are reducible to the worst-case decision rule.

Comment by Vanessa Kosoy (vanessa-kosoy) on The ground of optimization · 2021-12-16T14:21:27.334Z · LW · GW

In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications.

The key paragraph, which summarizes the definition itself, is the following:

An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.

In fact, "continues to exhibit this tendency with respect to the same target configuration set despite perturbations" is redundant: clearly as long as the perturbation doesn't push the system out of the basin, the tendency must continue.

This is what is known as "attractor" in dynamical systems theory. For comparison, here is the definition of "attractor" from the Wikipedia:

In the mathematical field of dynamical systems, an attractor is a set of states toward which a system tends to evolve, for a wide variety of starting conditions of the system. System values that get close enough to the attractor values remain close even if slightly disturbed.

The author acknowledges this connection, although he also makes the following remark:

We have discussed systems that evolve towards target configurations along some dimensions but not others (e.g. ball in a valley). We have not yet discovered whether dynamical systems theory explicitly studies attractors that operate along a subset of the system’s dimensions.

I find this remark confusing. An attractor that operates along a subset of the dimension is just an attractor submanifold. This is completely standard in dynamical systems theory.

Given that the definition itself is not especially novel, the post's main claim to value is via the applications. Unfortunately, some of the proposed applications seem to me poorly justified. Specifically, I want to talk about two major examples: the claimed relationship to embedded agency and the claimed relations to comprehensive AI services.

In both cases, the main shortcoming of the definition is that there is an essential property of AI that this definition doesn't capture at all. The author does acknowledge that "goal-directed agent system" is a distinct concept from "optimizing systems". However, he doesn't explain how are they distinct.

One way to formulate the difference is as follows: agency = optimization + learning. An agent is not just capable of steering a particular universe towards a certain outcome, it is capable of steering an entire class of universes, without knowing in advance in which universe it was placed. This underlies all of RL theory, this is implicit in the Shane-Legg definition of intelligence and my own[1], this is what Yudkowsky calls "cross domain".

The issue of learning is not just nitpicking, it is crucial to delineate the boundary around "AI risk", and delineating the boundary is crucial to constructively think of solutions. If we ignore learning and just talk about "optimization risks" then we will have to include the risk of pandemics (because bacteria are optimizing for infection), the risk of false vacuum collapse in particle accelerators (because vacuum bubbles are optimizing for expanding), the risk of runaway global warming (because it is optimizing for increasing temperature) et cetera. But, these are very different risks that require very different solutions.

There is another, less central, difference: the author requires a particular set of "target states" whereas in the context of agency it is more natural to consider utility functions, which means there is a continuous gradation of states rather than just "good states" and "bad states". This is related to the difference the author points out between his definition and Yudkowsky's:

When discerning the boundary between optimization and non-optimization, we look principally at robustness — whether the system will continue to evolve towards its target configuration set in the face of perturbations — whereas Yudkowsky looks at the improbability of the final configuration.

The improbability of the final configuration is a continuous metric, whereas just arriving or not arriving at a particular set is discrete.

Let's see how this shortcoming affects the conclusions. About embedded agency, the author writes:

One could view the Embedded Agency work as enumerating the many logical pitfalls one falls into if one takes the "optimizer" concept as the starting point for designing intelligent systems, rather than "optimizing system" as we propose here.

The correct starting point is "agent", defined in the way I gestured at above. If instead we start with "optimizing system" then we throw away the baby with the bathwater, since the crucial aspect of learning is ignored. This is an essential property of the embedded agency problem: arguably the entire difficulty is about how can we define learning without introducing unphysical dualism (indeed, I have recently addressed this problem, and "optimizing system" doesn't seem very helpful there).

About comprehensive AI services:

Our perspective is that there is a specific class of intelligent systems — which we call optimizing systems — that are worthy of special attention and study due to their potential to reshape the world. The set of optimizing systems is smaller than the set of all AI services, but larger than the set of goal-directed agentic systems.

What is an example of an optimizing AI system that is not agentic? The author doesn't give such an example and instead talks about trees, which are not AIs. I agree that the class of dangerous systems is substantially wider than the class of systems which were explicitly designed with agency in mind. However, this is precisely because agency can arise from such systems even when not explicitly designed, and moreover this is hard to avoid if the system is to be powerful enough for pivotal acts. This is not because there is some class of "optimizing AI systems" which are intermediate between "agentic" and "non-agentic".

To summarize, I agree with and encourage the use of tools from dynamical systems theory to study AI. However, one must acknowledge to correct scope of these tools and what they don't do. Moreover, more work is needed before truly novel conclusions can be obtained by these means.


  1. Modulo issues with traps which I will not go into atm. ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2021-12-16T12:51:50.526Z · LW · GW

The concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility.

Corrigibility is neither necessary nor sufficient for safety. It's not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it's not sufficient since an AI can be "corrigible" but cause catastrophic harm before someone notices and fixes it.

What we're supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don't say along which dimensions or how big the margin is. If it's infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there's no reason to talk about the former.

Comment by Vanessa Kosoy (vanessa-kosoy) on MikkW's Shortform · 2021-12-13T11:59:14.896Z · LW · GW

This means we should report the fractional dimension of an object not just as a single number, but use a continuous function that takes in a scalar describing the scale level, and telling us what the fractional dimension at that particular scale is.

The relevant keyword is covering number.

Comment by Vanessa Kosoy (vanessa-kosoy) on There is essentially one best-validated theory of cognition. · 2021-12-11T13:18:09.775Z · LW · GW

Hi Terry, can you recommend an introduction for people with mathematics / theoretical computer science background? I glanced at the paper you linked but it doesn't seem to have a single equation, mathematical statement or pseudocode algorithm. There are diagrams, but I have no idea what the boxes and arrows actually represent.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2021-12-09T13:30:30.992Z · LW · GW

There's a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).

First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn't contain terms such as "oh btw don't kill people while you're building the nanosystem". However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the "relative difficulty assumption" (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.

We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.

Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.

AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form "send such-and-such innocuous packet to such-and-such IP address". If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.

Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as "would be labeled by humans as such-and-such" instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.

In order to avoid learning human models, we can use methods along these lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t. the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn't help with learning the human-centric dataset, implying that the concept models don't contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.

Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big "if" but one we need to deal with anyway).

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2021-12-09T12:26:00.978Z · LW · GW

Master post for alignment protocols.

Other relevant shortforms:

Comment by Vanessa Kosoy (vanessa-kosoy) on Considerations on interaction between AI and expected value of the future · 2021-12-09T09:29:54.155Z · LW · GW

My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.

Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-08T18:59:04.834Z · LW · GW

Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.)

No! Sorry, if I gave that impression.

So what are you (and them) then? What would your utopia look like?

Well, I linked my toy model of partiality before. Are you asking about something more concrete?

Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-08T18:30:03.762Z · LW · GW

My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers.

I'm not sure what you mean here, but also the process is not iterated: the initial bargaining is deciding the outcome once and for all. At least that's the mathematical ideal we're approximating.

In the end, I think my concern is that we won't get buy-in from a large majority of users: In order to accommodate some proportion with odd moral views it seems likely you'll be throwing away huge amounts of expected value in others' views

I don't think so? The bargaining system does advantage large groups over small groups.

In practice, I think that for the most part people don't care much about what happens "far" from them (for some definition of "far", not physical distance) so giving them private utopias is close to optimal from each individual perspective. Although it's true they might pretend to care more than they do for the usual reasons, if they're thinking in "far-mode".

I would certainly be very concerned about any system that gives even more power to majority views. For example, what if the majority of people are disgusted by gay sex and prefer it not the happen anywhere? I would rather accept things I disapprove of happening far away from me than allow other people to control my own life.

Ofc the system also mandates win-win exchanges. For example, if Alice's and Bob's private utopias each contain something strongly unpalatable to the other but not strongly important to the respective customer, the bargaining outcome will remove both unpalatable things.

E.g. if you strong-denose anyone who's too willing to allow bargaining failure [everyone dies] you might end up filtering out altruists who worry about suffering risks.

I'm fine with strong-denosing negative utlitarianists who would truly stick to their guns about negative utilitarianism (but I also don't think there are many).

Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-08T18:05:06.183Z · LW · GW

This is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation?

I considering writing about this for a while, but so far I don't feel sufficiently motivated. So, the links I posted upwards in the thread are the best I have, plus vague gesturing in the directions of Hansonian signaling theories, Jaynes' theory of consciousness and Yudkowsky's belief in belief.

Comment by Vanessa Kosoy (vanessa-kosoy) on Considerations on interaction between AI and expected value of the future · 2021-12-08T17:55:22.869Z · LW · GW

I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:

  • Goodharting some proxy, such as making the reward signal go on instead of satisfying the human's request in order for the human to press the reward button. This usually produces a universe without people, since specifying a "person" is fairly complicated and the proxy will not be robustly tied to this concept.
  • Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe the daemon comes from a malign simulation hypothesis and the simulators are an evolved species so their values involve human-relevant concepts in some way. But it doesn't seem all that likely. And, if it turns out to be true, then a daemonic universe might as well happen to be good.
Comment by Vanessa Kosoy (vanessa-kosoy) on Considerations on interaction between AI and expected value of the future · 2021-12-07T20:54:08.326Z · LW · GW

I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?

Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-07T12:09:43.848Z · LW · GW

I want to add a little to my stance on utilitarianism. A utilitarian superintelligence would probably kill me and everyone I love, because we are made of atoms that could be used for minds that are more hedonic[1][2][3]. Given a choice between paperclips and utilitarianism, I would still choose utilitarianism. But, if there was a utilitarian TAI project along with a half-decent chance to do something better (by my lights), I would actively oppose the utilitarian project. From my perspective, such a project is essentially enemy combatants.


  1. One way to avoid it is by modifying utilitarianism to only place weight on currently existing people. But this is already not that far from my cooperative bargaining proposal (although still inferior to it, IMO). ↩︎

  2. Another way to avoid it is by postulating some very strong penalty on death (i.e. discontinuity of personality). But this is not trivial to do, especially without creating other problems. Moreover, from my perspective this kind of thing is hacks trying to work around the core issue, namely that I am not a utilitarian (along with the vast majority of people). ↩︎

  3. A possible counterargument is, maybe the superhedonic future minds would be sad to contemplate our murder. But, this seems too weak to change the outcome, even assuming that this version of utilitarianism mandates minds who would want to know the truth and care about it, and that this preference is counted towards "utility". ↩︎

Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-05T12:48:57.024Z · LW · GW

Yes, it's not a very satisfactory solution. Some alternative/complementary solutions:

  • Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
  • Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
  • Have the TAI learn from past data which wasn't affected by the incentives created by the TAI. (But, is there enough information there?)
  • Shape the TAI's prior about human values in order to rule out at least the most blatant lies.
  • Some clever mechanism design I haven't thought of. The problem with this is, most mechanism designs rely on money and money that doesn't seem applicable, whereas when you don't have money there are many impossibility theorems.
Comment by Vanessa Kosoy (vanessa-kosoy) on Morality is Scary · 2021-12-05T12:36:00.840Z · LW · GW

Ah - that's cool if IB physicalism might address this kind of thing

I admit that at this stage it's unclear because physicalism brings in the monotonicity principle that creates bigger problems than what we discuss here. But maybe some variant can work.

For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism).

Roughly speaking, in this case the 10% preserve their 10% of the power forever. I think it's fine because I want the buy-in of this 10% and the cost seems acceptable to me. I'm also not sure there is any viable alternative which doesn't have even bigger problems.