Vanessa Kosoy's Shortform

post by Vanessa Kosoy (vanessa-kosoy) · 2019-10-18T12:26:32.801Z · LW · GW · 185 comments

Contents

185 comments

185 comments

Comments sorted by top scores.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-14T12:13:32.292Z · LW(p) · GW(p)

Text whose primary goal is conveying information (as opposed to emotion, experience or aesthetics) should be skimming friendly. Time is expensive, words are cheap. Skimming is a vital mode of engaging with text, either to evaluate whether it deserves a deeper read or to extract just the information you need. As a reader, you should nurture your skimming skills. As a writer, you should treat skimmers as a legitimate and important part of your target audience. Among other things it means:

  • Good title and TLDR/abstract
  • Clear and useful division into sections
  • Putting the high-level picture and conclusions first, the technicalities and detailed arguments later. Never leave the reader clueless about where you’re going with something for a long time.
  • Visually emphasize the central points and make them as self-contained as possible. For example, in the statement of mathematical theorems avoid terminology whose definition is hidden somewhere in the bulk of the text.
Replies from: romeostevensit
comment by romeostevensit · 2021-03-20T01:21:50.744Z · LW(p) · GW(p)

Stronger: as a writer you should assume your modal reader is a skimmer, both because they are, and because even non skimmers are only going to remember about the same number of things that the good skimmer does.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-19T18:23:19.455Z · LW(p) · GW(p)

I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:

  • Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.
  • Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.
  • Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.
  • Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

  • It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.
  • The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.
  • In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.
Replies from: alexander-gietelink-oldenziel
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-26T15:28:53.199Z · LW(p) · GW(p)

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms[1] might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevertheless, some superhuman performance might be produced by sped-up simulation, reducing noise in human behavior and controlling the initial conditions (e.g. simulating a human on a good day). As a result, we will have some period of time during which AGI is already here, automation is in full swing, but there's little or no further escalation. At the end of this period, the missing ingredients will be assembled (maybe with the help of AI researchers) and superhuman AI (possibly a fast takeoff) begins.

It's interesting to try and work out the consequences of such a scenario, and the implications on AI strategy.


  1. Such as GPT-n ↩︎

Replies from: Vladimir_Nesov, avturchin
comment by Vladimir_Nesov · 2020-09-27T09:13:29.894Z · LW(p) · GW(p)

This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-27T10:30:29.646Z · LW(p) · GW(p)

That is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2020-09-27T11:23:13.138Z · LW(p) · GW(p)

To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it's necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman).

Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset, they can be learned. The habits of thought that are simple enough to summarize get summarized and end up requiring fewer steps. My guess is that the human faculties needed for AGI can be both represented by sequences of thoughts (probably just text, stream of consciousness style) and easily learned with current ML. So right now the main obstruction is that it's not feasible to build a dataset with those faculties represented explicitly that's good enough and large enough for current sample-inefficient ML to grok. More compute in the learning algorithm is only relevant for this to the extent that we get a better dataset generator that can work on the tasks before it more reliably.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-27T11:46:46.882Z · LW(p) · GW(p)

I don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2020-09-27T12:30:48.120Z · LW(p) · GW(p)

I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn't pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it's not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans).

For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-27T13:09:15.277Z · LW(p) · GW(p)

The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario.

EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2020-09-27T14:13:34.873Z · LW(p) · GW(p)

I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job).

So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-27T16:16:12.611Z · LW(p) · GW(p)

Another thing that might happen is a data bottleneck.

Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).

Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.

In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.

comment by avturchin · 2020-09-26T17:14:48.960Z · LW(p) · GW(p)

Another way to describe the same (or similar) plateau: we could think about GPT-n as GLUT with approximation between prerecorded answers: it can produce intelligent products similar to the ones which were created by humans in the past and are presented in its training dataset – but not above the human intelligence level, as there is no superintelligent examples in the dataset. 

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-17T14:59:27.268Z · LW(p) · GW(p)

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound [AF(p) · GW(p)] I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL [LW · GW] setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let be the the user's policy in universe and the AI policy. Let be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability for any policy. Let be the value of a state from the user's subjective POV, in universe . Let be the environment in universe . Finally, let be the AI's prior over universes and some sufficiently small bound. We require

Here, designates the value after event happens, and is defined to be for and otherwise.

I now outline an example of a Hippocratic protocol, which I call Hippocratic timeline-driven learning (HTDL).

Suppose, for simplicity, that the goal is accomplishing some binary task (i.e. we can either succeed or fail). We can imagine the task to be something like "construct a defense system against malign AI". At any given moment, the user reports their subjective timeline: i.e. for, any , the probabilities of success and failure after time , assuming the user will continue working on the task by themselves. We assume the AI can learn a good model of the user's policy and the user's nominal report, but this nominal report might deviate from the true timeline because the system can be corrupted (i.e. the user manipulated to enter false information or the input channel hacked). Moreover, we assume that such corruption counts as failing the task.

The AI operates by maximizing the AI-expectation of the user's (unknown) utility function , subject to the Hippocratic principle as a constraint. Since we're working in a delegative setting, the AI can trivially achieve the constraint by always delegating. However, it can do much better than that, by short-term quantilizing w.r.t. the user's reported success probability (with the user's policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user's reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation. Then, after completing this short-term maneuver, the AI can do another short-term quantilization, and continue in this vein. Essentially, the AI is gradually bending the user's trajectory towards success.

Of course, we can improve the guarantees of such a protocol either by adding more information or by using stronger assumptions (e.g. assume the user is quasi-rational). But timeline-driven short-term quantilization serves as a useful feasibility proof.

Replies from: steve2152, TurnTrout, vanessa-kosoy, Charlie Steiner, adamShimi, gerald-monroe
comment by Steven Byrnes (steve2152) · 2021-08-24T15:20:23.073Z · LW(p) · GW(p)

(Update: I don't think this was 100% right, see here [LW · GW] for a better version.)

Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she'll succeed at the task, and if so, how soon.

At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H's own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).

Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H's perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself "gee you dunce you're gonna regret that", and shortly thereafter the AI is proven correct.

OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.

But we don't want to do that naive thing. There are two problems:

The first problem is "traps" (a.k.a. catastrophes). Let's say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn't either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn't take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits back and lets H do the exploring, at least when a trap might be involved.

But here we want to go further. Let's say that there's some action which H might do, where the AI can predict the consequence pretty well, and the consequence is "H will eventually feel more optimistic about success in the task". Then the AI can go do that action, right? Not necessarily! The action might still be bad, because it involved corrupting the feedback channel.

So let's say "action 789 = H will eat a pot brownie". Taking that action makes H both much more optimistic about success and much less likely to actually succeed in the task. (It makes her assessments inaccurate.)

If the AI has no idea what pot brownies do, it won't take action 789 because of the previous thing—maybe that action is a trap. But what if the AI has (indirectly) figured out, with 100% confidence, exactly what the consequence of action 789 would be?

In that case, since the AI's decision criterion is all about H's future reported odds of success, a naive algorithm would have the AI take action 789. To avoid that, we have this quantilization [? · GW] thing. How does that work?

Well, let's say that H right now assigns 99% chance of eventual success. Therefore—assuming that H is well-calibrated and knows the consequences of action 789—it follows that H's policy is to take that action with ≤1% probability. Basically, H is optimistic about success in the task, partly because H knows that she's almost definitely not going to eat that pot brownie. I mean, she might eat it, if she gets really bored and frustrated and hungry. But probably not! So she's feeling very optimistic.

OK, now the AI quantilizes by drawing 5 random samples from (its understanding of) H's policy, and taking the action that (it believes) will lead to the highest future-optimism-about-success. Assume its beliefs are accurate. Compared to just doing what H would have done, this has downsides and upsides. The downside is, since action 789 scores really well by the AI's decision criterion, the probability of action 789 is no longer ≤1%, but rather ≤5%. The upside is, if we're not taking action 789, i.e. the other ≥95% of the probability mass, we're probably doing something quite a bit better than what H would have done.

These could plausibly balance out such that the expected probability of success strictly increases, compared to what H would have done.

Please correct me if I'm misunderstanding anything.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-08-24T17:25:43.567Z · LW(p) · GW(p)

This is about right.

Notice that typically we use the AI for tasks which are hard for H. This means that without the AI's help, H's probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.

On a typical task, H expects to fail eventually but they don't expect to fail soon. Therefore, the AI can safely consider a policies of the form "in the short-term, do something H would do with marginal probability, in the long-term go back to H's policy". If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it's possible that in the new prognosis H still doesn't expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.

comment by TurnTrout · 2021-12-16T20:42:11.318Z · LW(p) · GW(p)

The Hippocratic principle seems similar to my concept of non-obstruction (https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility [LW · GW]), but subjective from the human's beliefs instead of the AI's.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-01-06T12:49:23.926Z · LW(p) · GW(p)

Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by "continuously" I mean that we are allowed to compare keeping the AI on or turning off at any given moment.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-30T21:48:29.720Z · LW(p) · GW(p)

"Corrigibility" is usually defined as the property of AIs who don't resist modifications by their designers. Why would we want to perform such modifications? Mainly it's because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn't it also be flawed in a way that destroys corrigibility?

In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user's policy and preferences (be it through timelines, revealed preferences or anything else).

Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.

For example, consider an HDTL agent getting upgraded to "Hippocratic CIRL" (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn't deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2021-12-15T16:06:02.214Z · LW(p) · GW(p)

if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?

I think the people most interested in corrigibility are imagining a situation where we know what we're doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don't even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we "figure out alignment."

Maybe this is a strawman, because the thing I'm describing doesn't make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-16T12:51:50.526Z · LW(p) · GW(p)

The concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility.

Corrigibility is neither necessary nor sufficient for safety. It's not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it's not sufficient since an AI can be "corrigible" but cause catastrophic harm before someone notices and fixes it.

What we're supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don't say along which dimensions or how big the margin is. If it's infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there's no reason to talk about the former.

comment by Charlie Steiner · 2021-09-19T15:58:12.340Z · LW(p) · GW(p)

Very interesting - I'm sad I saw this 6 months late.

After thinking a bit, I'm still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable.

One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually supports your proposal - it seems entirely plausible to make a chess engine that only makes moves that some given population of humans recognize as good, but is better than any human from that population.

On the other hand, the humans might be wrong about the reason the move is good, so that the game is made of a bunch of moves that seem good to humans, but where the humans are actually wrong about why they're good (from the human perspective, this looks like regularly having "happy surprises"). We might hope that such human misevaluations are rare enough that quantilization would lead to moves on average being well-evaluated by humans, but for chess I think that might be false! Computers are so much better than humans at chess that a very large chunk of the best moves according to both humans and the computer will be ones that humans misevaluate.

Maybe that's more a criticism of quantilizers, not a criticism of this desideratum. So maybe the chess example supports this being a good thing to want? But let me keep critiquing quantilizers then :P

If what a powerful AI thinks is best (by an exponential amount) is to turn off the stars until the universe is colder, but humans think it's scary and ban the AI from doing scary things, the AI will still try to turn off the stars in one of the edge-case ways that humans wouldn't find scary. And if we think being manipulated like that is bad and quantilize over actions to make the optimization milder, turning off the stars is still so important that a big chunk of the best moves according to both humans and the computer are going to be ones that humans misevaluate, and the computer knows will lead to a "happy surprise" of turning off the stars not being scary. Quantilization avoids policies that precisely exploit tiny features of the world, and it avoids off-distribution behavior, but it still lets the AI get what it wants if it totally outsmarts the humans.

The other thing this makes me think of is Lagrange multipliers. I bet there's a duality between applying this constraint to the optimization process, and adding a bias (I mean, a useful prior) to the AI's process for modeling .

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-09-20T00:16:30.180Z · LW(p) · GW(p)

When I'm deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?

One potentially useful model is: I'm good at evaluating and bad at searching (after all, ). I can therefore delegate searching to another agent. But, as you point out, this doesn't account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.

A better model is: I'm good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL [AF(p) · GW(p)]) which we can think of as an "oracle". This allows the AI to have additional information which is purely "logical". We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle's outputs entangled with other beliefs about the universe.

For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it's a good move, even without having the skill to understand why the move is good in any more detail.

Now let's examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user's. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI. But, this might be good enough [LW(p) · GW(p)].

I'm not sure what you're saying in the "turning off the stars example". If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2021-09-20T15:14:06.550Z · LW(p) · GW(p)

Agree with the first section, though I would like to register my sentiment that although "good at selecting but missing logical facts" is a better model, it's still not one I'd want an AI to use when inferring my values.

I'm not sure what you're saying in the "turning off the stars example". If the probability for the user to autonomously decide to turn off the stars is much lower than the quantilization fraction, then the probability that quantilization will decide to turn off the stars is low. And, the quantilization fraction is automatically selected like this.

I think my point is if "turn off the stars" is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we're probably going there.

Another way of looking at this claim is that I'm telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-09-20T18:41:33.098Z · LW(p) · GW(p)

I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn't deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user's evaluation in the end of this short-term interval.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2021-09-20T21:01:45.031Z · LW(p) · GW(p)

Ah. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as quantilizing over short-term policies evaluated according to their expected utility. My story doesn't make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).

comment by adamShimi · 2021-03-27T20:30:25.490Z · LW(p) · GW(p)

However, it can do much better than that, by short-term quantilizing w.r.t. the user's reported success probability (with the user's policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user's reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.

I don't understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don't see a distribution here, or even a clear ordering over actions (given that we don't have access to the utility function).

I'm probably missing something obvious, but more details would really help.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-29T16:48:16.385Z · LW(p) · GW(p)

The distribution is the user's policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it [AF · GW] for MDPs.

Replies from: adamShimi
comment by adamShimi · 2021-03-29T18:42:30.460Z · LW(p) · GW(p)

Oh, right, that makes a lot of sense.

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-29T19:43:21.771Z · LW(p) · GW(p)

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corruption is bounded by .

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

comment by Gerald Monroe (gerald-monroe) · 2021-03-17T17:15:33.299Z · LW(p) · GW(p)

I like this because it's simple and obviously correct.  Also I can see at least one way you could implement it:

   a.  Suppose the AI is 'shadowing' a human worker doing a critical task.  Say it is 'shadowing' a human physician. 

   b.  Each time the AI observes the same patient, it regresses between [data from the patient] and [predicted decision a 'good' physician would make, predicted outcome for the 'good' decision].  Once the physician makes a decision and communicates it, the AI regresses between [decision the physician made] and [predicted outcome for that decision].

   c.  The machine also must have a confidence or this won't work.

With large numbers and outright errors made by the physician, it's then possible to detect all the cases where the [decision the physician made] has a substantially worse outcome than the [predicted decision a 'good' physician would make], and when the AI has a high confidence of this [requiring many observations of similar situations] and it's time to call for a second opinion.

In the long run, of course, there will be a point where the [predicted decision a 'good' physician would make] is better than the [information gain from a second human opinion] and you really would do best by firing the physician and having the AI make the decisions from then on, trusting for it to call for a second opinion when it is not confident. 

 (as an example, alpha go zero likely doesn't benefit from asking another master go player for a 'second opinion' when it sees the player it is advising make a bad call)

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-05-06T19:34:30.109Z · LW(p) · GW(p)

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions [AF · GW].)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, description complexity is only naturally well-defined up to an additive constant. So, if we want to have a crisp concept, we need to consider an asymptotic in which the complexity of something goes to infinity. Indeed, it seems natural to ask that the complexity of the policy should be much higher than the complexity of the prior and the reward function: in this case we can say that the "intentional stance" is an efficient description. However, this doesn't make sense with description complexity: the description "optimal policy for and " is of size ( stands for "description complexity of ").

To salvage this idea, we need to take not only description complexity but also computational complexity into account. [EDIT: I was wrong, and we can get a well-defined concept in the unbounded setting too, see child comment [LW(p) · GW(p)]. The bounded concept is still interesting.] For the intentional stance to be non-vacuous we need to demand that the policy does some "hard work" in order to be optimal. Let's make it formal. Consider any function of the type where and are some finite alphabets. Then, we can try to represent it by a probabilistic automaton , where is the finite set space, is the transition kernel, and we're feeding symbols into the automaton one by one. Moreover, can be represented as a boolean circuit and this circuit can be the output of some program executed by some fixed universal Turing machine. We can associate with this object 5 complexity parameters:

  • The description complexity, which is the length of .
  • The computation time complexity, which is the size of .
  • The computation space complexity, which is the maximum between the depth of and .
  • The precomputation time complexity, which is the time it takes to run.
  • The precomputation space complexity, which is the space needs to run.

It is then natural to form a single complexity measure by applying a logarithm to the times and taking a linear combination of all 5 (we apply a logarithm so that a brute force search over bits is roughly equivalent to hard-coding bits). The coefficients in this combination represent the "prices" of the various resources (but we should probably fix the price of description complexity to be 1). Of course not all coefficients must be non-vanishing, it's just that I prefer to keep maximal generality for now. We will denote this complexity measure .

We can use such automatons to represent policies, finite POMDP environments and reward functions (ofc not any policy or reward function, but any that can be computed on a machine with finite space). In the case of policies, the computation time/space complexity can be regarded as the time/space cost of applying the "trained" algorithm, whereas the precomputation time/space complexity can be regarded as the time/space cost of training. If we wish, we can also think of the boolean circuit as a recurrent neural network.

We can also use to define a prior , by ranging over programs that output a valid POMDP and assigning probability proportional to to each instance. (Assuming that the environment has a finite state space might seem restrictive, but becomes quite reasonable if we use a quasi-Bayesian setting with quasi-POMDPs that are not meant to be complete descriptions of the environment; for now we won't go into details about this.)

Now, return to our policy . Given , we define that " has goal-directed intelligence (at least) " when there is a suitable prior and utility function s.t. for any policy , if then . When (i.e. no finite automaton can match the expected utility of ; in particular, this implies is optimal since any policy can be approximated by a finite automaton), we say that is "perfectly goal-directed". Here, serves as a way to measure the complexity of , which also ensures is non-dogmatic in some rather strong sense.

[EDIT: if we fix and then is essentially the same as Yudkowsky's definition of optimization power [AF · GW] if we regard the policy as the "outcome" and use as our measure on the space of outcomes.]

With this definition we cannot "cheat" by encoding the policy into the prior or into the utility function, since that would allow no complexity difference. Therefore this notion seems like a non-trivial requirement on the policy. On the other hand, this requirement does hold sometimes, because solving the optimization problem can be much more computationally costly than just evaluating the utility function or sampling the prior.

Replies from: AIL, vanessa-kosoy, capybaralet
comment by AIL · 2023-02-24T18:58:55.883Z · LW(p) · GW(p)

I am not sure I understand your use of  in the third from last paragraph where you define goal directed intelligence. As you define  it is a complexity measure over programs . I assume this was a typo and you mean ? Or am I misunderstanding the definition of either  or ?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-02-25T06:45:27.466Z · LW(p) · GW(p)

This is not a typo.

I'm imagining that we have a program that outputs (i) a time discount parameter , (ii) a circuit for the transition kernel of an automaton and (iii) a circuit for a reward function (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is defined by

where is defined recursively by

Replies from: AIL
comment by AIL · 2023-03-21T21:39:57.232Z · LW(p) · GW(p)

Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself.

But just to check: is T over  ? I thought T in utility functions only depended on states and actions 

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with  in your post which was defined over ? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things.  So I'm not sure if you intended the rewards to correspond with the observations or 'underlying' states. 

One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the "set of conditional observation probabilities" (O on Wikipedia)? Or is it capturing the "set of conditional transition probabilities between states" (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior? 
I ask because the term  will be positive infinity if  is zero for any value where  is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states  and  and two obersvations  and  an O where  and  if  would have a KL divergence of infinity from the  if  had non-zero probability on ). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies that T is defined in the utility function U instead, and is deterministic? 

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-03-22T12:42:57.000Z · LW(p) · GW(p)

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with  in your post which was defined over ?

I'm not entirely sure what you mean by the state space.  is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining  in the parent comment. Let's say it's  and  is defined by using  to translate the history to the (last) state and then applying .

One more question, this one about the priors: what are they a prior over exactly? ...I ask because the term  will be positive infinity if  is zero for any value where  is non-zero.

The prior is just an environment i.e. a partial mapping  defined on every history to which it doesn't itself assign probability . The expression  means that we consider all possible ways to choose a Polish space probability distributions  and a mapping  s.t.  and  (where the expected value is defined using the Bayes law and not pointwise, see also the definition of "instrumental states" here [LW · GW]), and take the minimum over all of them of .

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-05-09T10:20:48.714Z · LW(p) · GW(p)

Actually, as opposed to what I claimed before, we don't need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.

Given , we define that " has (unbounded) goal-directed intelligence (at least) " when there is a prior and utility function s.t. for any policy , if then . Here, is the Solomonoff prior and is Kolmogorov complexity. When (i.e. no computable policy can match the expected utility of ; in particular, this implies is optimal since any policy can be approximated by a computable policy), we say that is "perfectly (unbounded) goal-directed".

Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable "hell" state. On the other hand, goal-directed intelligence differs only by between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn't depend on the UTM at all.

I think that it's also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any there is s.t. the probability to get intelligence at least is smaller than .

Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by , and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vector. Informally speaking, an agent that is very smart relatively to a context with cheap computational resources is still very smart relatively to a context where they are expensive, which makes intuitive sense.

If we choose just one computational resource, we can speak of the minimal price for which a given policy is perfectly goal-directed, which is another way to measure intelligence with a more restricted domain. Curiously, our bounded Solomonoff-like prior has the shape of a Maxwell-Boltzmann distribution in which the prices are thermodynamic parameters. Perhaps we can regard the minimal price as the point of a phase transition.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-21T14:15:25.741Z · LW(p) · GW(p)

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong's thesis about the impossibility to deduce preferences from behavior alone.

  1. Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote the space of deterministic policies, and consider some . Suppose is equivalent to a stochastic policy . Then, .

  2. Find an "intelligence hierarchy theorem". That is, find an increasing sequence s.t. for every , there is a policy with goal-directed intelligence in (no more and no less).

  3. What is the computational complexity of evaluating given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?

  4. What is the computational complexity of producing a policy with given ?

  5. Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every with and , and every , there exists s.t. for every policy with intelligence at least w.r.t. , and every s.t. has intelligence at least w.r.t. them, any optimal policies for and respectively satisfy .

Replies from: Davidmanheim
comment by Davidmanheim · 2021-01-05T07:58:53.365Z · LW(p) · GW(p)

re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't some  such that .

(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-01-11T16:44:01.117Z · LW(p) · GW(p)

re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming.

You misunderstand the intent. We're talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown , but producing some behavior that optimizes the unknown . Ofc if the policy you're observing is optimal then it's trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like "the policy you're observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity."

(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)

(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)

I am referring to this and related work by Armstrong.

comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-02-02T12:35:38.602Z · LW(p) · GW(p)

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).

It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.

So...

  • Do you think this analysis is correct?  Or what is it missing?  (maybe the assumption that the policy is deterministic is significant?  This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
  • Are you trying to get around this somehow?  Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-02-02T14:14:35.631Z · LW(p) · GW(p)

My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have . This corresponds to (no/low intelligence). On the other hand, policies with (high intelligence) have the property that for the which "justifies" this . In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.

Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-02-05T12:01:16.830Z · LW(p) · GW(p)

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-02-06T16:20:20.021Z · LW(p) · GW(p)

For the contrived reward function you suggested, we would never have . But for other reward functions, it is possible that . Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-09-12T09:13:05.339Z · LW(p) · GW(p)

A thought inspired by this [LW(p) · GW(p)] thread. Maybe we should have a standard template for a code of conduct for organizations, that we will encourage all rational-sphere and EA orgs to endorse. This template would include, never making people sign non-disparagement agreements (and maybe also forbidding other questionable practices that surfaced in recent scandals). Organizations would be encouraged to create their own codes based on the template and commit to them publicly (and maybe even in some legally binding manner). This flexibility means we don't need a 100% consensus about what has to be in the code, but also if e.g. a particular org decides to remove a particular clause, that will be publicly visible and salient.

Replies from: Viliam
comment by Viliam · 2023-09-12T11:08:52.719Z · LW(p) · GW(p)

Codes created by organizations can simply avoid topics that are sensitive for them, or describe them in nebulous ways. You can probably imagine a code written by a bad organization that seems nice and is technically followed by the bad organization, mostly because it contains lots of applause lights but not the specific boring things.

I am in favor of having one centrally created document "this is what a decent EA employment looks like". Of course it is optional for everyone. The point is to create common knowledge. Especially for young people, for whom it may be the first job ever. It's not to force everyone to follow it, but rather to show that if your employer does not follow it, then it is not normal, if you complain you are sane, and you can do better (while remaining in the EA area).

As I imagine it, most of it wouldn't even be specific for EA, but rather the common sense that inexperienced people may miss. Such as "you are entitled to a salary, paid in cash, where the specific number is written in the contract". To prevent employers from saying things like: "you know, this is how it works in for-profit companies, but EAs are different".

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-09-12T17:48:28.398Z · LW(p) · GW(p)

If a particular code doesn't talk about e.g. non-disparagement agreements, or talks about them in some confusing, unclear way, then people will notice. The point of having a template is precisely drawing attention to what is expected to be there (in particular when it's not there).

Also, I think we should really strive to be better than for-profit companies (see non-disparagement agreements again).

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-01-21T10:30:29.015Z · LW(p) · GW(p)

Epistemic status: Leaning heavily into inside view, throwing humility to the winds.

Imagine TAI is magically not coming (CDT-style counterfactual[1]). Then, the most notable-in-hindsight feature of modern times might be the budding of mathematical metaphysics (Solomonoff induction, AIXI, Yudkowsky's "computationalist metaphilosophy"[2], UDT, infra-Bayesianism...) Perhaps, this will lead to an "epistemic revolution" comparable only with the scientific revolution in magnitude. It will revolutionize our understanding of the scientific method (probably solving the interpretation of quantum mechanics[3], maybe quantum gravity, maybe boosting the soft sciences). It will solve a whole range of philosophical questions, some of which humanity was struggling with for centuries (free will, metaethics, consciousness, anthropics...)

But, the philosophical implications of the previous epistemic revolution were not so comforting (atheism, materialism, the cosmic insignificance of human life)[4]. Similarly, the revelations of this revolution might be terrifying[5]. In this case, it remains to be seen which will seem justified in hindsight: the Litany of Gendlin, or the Lovecraftian notion that some knowledge is best left alone (and I say this as someone fully committed to keep digging into this mine of Khazad-dum).

Of course, in the real world, TAI is coming.


  1. The EDT-style counterfactual "TAI is not coming" would imply that a lot of my thinking on related topics is wrong which would yield different conclusions. The IB-style counterfactual (conjunction of infradistributions) would probably be some combination of the above with "Nirvana" (contradiction) and "what if I tried my hardest to prevent TAI from coming", which is also not my intent here. ↩︎

  2. I mean the idea that philosophical questions can be attacked by reframing them as computer science questions ("how an algorithm feels from inside" et cetera). The name "computationalist metaphilosophy" is my own, not Yudkowsky's. ↩︎

  3. No, I don't think MWI is the right answer. ↩︎

  4. I'm not implying that learning these implications was harmful. Religion is comforting for some but terrifying and/or oppressive for others. ↩︎

  5. I have concrete reasons to suspect this, that I will not go into (suspect = assign low but non-negligible probability). ↩︎

Replies from: crabman, MackGopherSena
comment by philip_b (crabman) · 2022-02-01T23:50:33.527Z · LW(p) · GW(p)

I'm not sure what you mean by CDT- and EDT-style counterfactuals. I have some guesses but please clarify. I think EDT-style counterfactual means, assuming I am a bayesian reasoner, just conditioning on the event "TAI won't come", so it's thinking about the distribution P(O | TAI won't come).

One could think that the CDT-counterfactual you're considering means thinking about the distribution P(O | do(TAI doesn't come)) where do is the do operator from Judea Pearl's do calculus for causality. In simple words, this means that we consider the world just like ours but whenever someone tries to launch a TAI, god's intervention (that doesn't make sense together with everything we know about physics) prevents it from working. But I think this is not what you mean.

My best guess of what counterfactual you mean is as follows. Among all possible sets laws of physics (or, alternatively, Turing machines running which leads to existence of physical realities), you guess that there exists a set of laws that produces a physical reality where there will appear a civilization approximately (but not exactly) like hours and they'll have a 21-st century approximately like hours, but under their physical laws there won't be TAI. And you want to analyze what's going to happen with that civilization.

comment by MackGopherSena · 2022-01-21T12:36:42.191Z · LW(p) · GW(p)

[edited]

Replies from: Viliam
comment by Viliam · 2022-01-21T18:14:37.878Z · LW(p) · GW(p)

You seem to be implying that they will be terrifying for the exact opposite reasons why the previous epistemic revolution's philosophical implications were.

What do you mean by "exact opposite reasons"? To me, it seems like continuation of the same trend of humiliating the human ego:

  • you are not going to live forever
  • yes, you are mere atoms
  • your planet is not the center of the universe
  • even your sun is not special
  • your species is related to the other species that you consider inferior
  • instead of being logical, your mind is a set of short-sighted agents fighting each other

Followed by:

  • even your reality is not special
  • your civilization is too stupid to stop doing the thing(s) that will predictably kill all of you
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-18T15:21:27.185Z · LW(p) · GW(p)

Probably not too original but I haven't seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.

Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than amplifying by objective time, because learning previous results is an overhead, and illegible intuitions might be hard to transmit. This is safer than amplifying by objective time, but if there is some probability of malign AI created in the short time period, there is still an attack vector. The malign AI leakage in this method is roughly proportional to subjective time of simulation times the present rate of malign AI takeover, as opposed to amplification by objective time where leakage is proportional to subjective time of simulation times some average future rate of malign AI takeover. However, by the time we are able to create this benign AI, the present rate of malign AI takeover might also be considerable.

Amplifying by probability: We allow the user(s) to choose "success" or "failure" (or some continuous metric) after completing their work, and make the AI skew the distribution of predictions toward success. This is similar to amplifying by subjective time without any transmission of information. It is weaker and about as safe. The potential advantage is, lower sample complexity: the AI only needs to have a reliable distribution of outcomes after the initial state instead of subsequent states.

Amplifying by parallelization: The AI is predicting the output of many copies of the user working together, by having strictly defined interfaces between the copies, over a time period similar to real time. For example, we can imagine a hierarchical organization where each person gives subtasks to their subordinates. We can then simulate such an organization with a copy of some subset of users in each role. To do this, the AI only needs to learn what a given subset of users would do given a particular task from their supervisors and particular results by their subordinates. This method is weaker than previous methods since it requires that the task at hand can be parallelized. But, it is also the safest since the rate of malign AI takeover is only amplified by compared to the background. [EDIT: Actually, it's not safer than subjective time because the AI would sample the external world independently for each node in the organization. To avoid this, we would need to somehow define a correspondence between the outcome sets of worlds in which the user was queried at different nodes, and I don't know how to do this.]

A complete solution can try to combine all of those methods, by simulating a virtual organization where the members can control which method is applied at every point. This way they can strive for the optimal risk-performance balance: parallelize everything that can be parallelized and amplify otherwise tasks that cannot be parallelized, change the subjective/objective time balance based on research into malign AI timelines etc.

Replies from: vanessa-kosoy, TurnTrout
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-29T19:11:07.468Z · LW(p) · GW(p)

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".

  • The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.
  • Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.
  • Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.

  1. More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk. ↩︎

comment by TurnTrout · 2021-03-18T15:33:57.258Z · LW(p) · GW(p)

I think this would make a good top-level post. I have the feeling I’ll want to link to it later.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-03-18T16:06:05.410Z · LW(p) · GW(p)

I retracted part of that, see the edit.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-03-23T16:17:45.672Z · LW(p) · GW(p)

I have [AF(p) · GW(p)] repeatedly [AF(p) · GW(p)] argued [AF(p) · GW(p)] for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianism to win here. We only need to somehow justify randomization, perhaps by something like infinitesimal random perturbations of the belief state (like with reflective oracles). But, in a way, this is exactly what quasi-Bayesianism does: a quasi-Bayes-optimal policy is in particular Bayes-optimal when the prior is taken to be in Nash equilibrium of the associated zero-sum game. However, Bayes-optimality underspecifies the policy: not every optimal reply to a Nash equilibrium is a Nash equilibrium.

This argument is not entirely novel: it is just a special case of an environment that the agent cannot simulate, which is the original motivation for quasi-Bayesianism. In some sense, any Bayesian agent is dogmatic: it dogmatically beliefs that the environment is computationally simple, since it cannot consider a hypothesis which is not. Here, Omega exploits this false dogmatic belief.

Replies from: Dagon, Pattern
comment by Dagon · 2020-03-23T17:22:39.317Z · LW(p) · GW(p)

Bayeseans are allowed to understand that there are agents with better estimates than they have. And that being offered a bet _IS_ evidence that the other agent THINKS they have an advantage.

Randomization (aka "mixed strategy") is well-understood as the rational move in games where opponents are predicting your choices. I have read nothing that would even hint that it's unavailable to Bayesean agents. The relevant probability (updated per Bayes's Rule) would be "is my counterpart trying to minimize my payout based on my choices".

edit: I realize you may be using a different definition of "bayeseanism" than I am. I'm thinking humans striving for rational choices, which perforce includes the knowledge of incomplete computation and imperfect knowledge. Naive agents can be imagined that don't have this complexity. Those guys are stuck, and Omega's gonna pwn them.

Replies from: mr-hire
comment by Matt Goldenberg (mr-hire) · 2020-03-23T17:41:53.796Z · LW(p) · GW(p)

I'm thinking humans striving for rational choices,

It feels like there's better words for this like rationality, whereas bayesianism is a more specific philosophy about how best to represent and update beliefs.

comment by Pattern · 2020-03-25T07:11:20.152Z · LW(p) · GW(p)

And here I thought the reason was going to be that Bayesianism doesn't appear to include the cost of computation. (Thus, the usual dutch book arguments should be adjusted so that "optimal betting" does not leave one worse off for having payed, say, an oracle, too much for computation.)

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-26T14:42:47.388Z · LW(p) · GW(p)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requires considering shallow time discounts, otherwise there is no notion of long-term planning. For shallow time discounts, the game theory of a repeated game is very different from the game theory of the original normal form game. In fact, the folk theorem asserts that any payoff vector above the maximin of each player is a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts (more or less) to proving converges to at least the maximin payoff. This is possible using incomplete models [AF · GW], but doesn't seem very interesting: to receive the maximin payoff, the agents only have to learn the rules of the game, they need not learn the reward functions of the other players or anything else about them.

We arrive at the question, what setting is realistic (in the sense of involving learning with shallow time discount) and is expected to produce Nash equilibria for a normal form game? I suggest the following. Instead of a fixed set of agents repeatedly playing against each other, we consider a population of agents that are teamed-off randomly on each round of the game. The population is assumed to be large enough for agents not to encounter each other more than once. This can be formalized as follows. Let be the pure strategy set of the -th agent and the set of pure outcomes. The set of -round outcome histories is . The population of agents on the -round can then be described as a probability measure . Suppose the policy of the -th player (that is, of all the agents that take the role of the -th player) is . Then we can define a time evolution rule that produces from . This rule works as follows: in order to sample we sample once per player (this is the history the given player has seen), sample the policy of each player on its own history, and produce a new history by appending the resulting outcome to one of the old histories (it doesn't matter which). A set of policies is considered to be in equilibrium, when for any , and any alternative policy , letting play against the same population (i.e. all other copies of the -th player still play ) doesn't improve expected utility. In other words, on each round the "mutant" agent retains its own history but the other player histories are still sampled from the same . It is easy to see that any equilibrium payoff in this setting is a Nash payoff in the original normal form game. We can then legitimately ask whether taking the to be learning algorithms would result in convergence to a Nash payoff in the (shallow time discount) limit.

For example, consider the Prisoner's dilemma. In the repeated Prisoner's dilemma with shallow time discount, is an equilibrium because of the tit-for-tat policy. On the other hand, in the "population" (massively multi-player?) repeated Prisoner's dilemma, is the only equilibrium. Tit-for-tat doesn't work because a single "defect bot" can exploit a population of tit-for-tats: on each round it plays with a new opponent that doesn't know the defect bot defected on the previous round.

Note that we get a very different setting if we allow the players to see each other's histories, more similar (equivalent?) to the regular repeated game. For example, in the Prisoner's Dilemma we have a version of tit-for-tat that responds to what its current opponent played in its previous round (against a different opponent). This may be regarded as a confirmation of the idea that agents that know each other's source code are effectively playing a repeated game: in this setting, knowing the source code amounts to knowing the history.

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-01T16:03:19.620Z · LW(p) · GW(p)

We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents' histories. This should lead to superrationality for the same reasons I discussed [AF(p) · GW(p)] before [AF · GW]. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.

What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don't know to define this limitation.

EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.

Replies from: vanessa-kosoy, Gurkenglas
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-08T17:39:38.596Z · LW(p) · GW(p)

Instead of postulating access to a portion of the history or some kind of limited access to the opponent's source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size . This is because something like "counting pieces of evidence" should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming memory (or whatever is needed for learning) is enough to reach superrationality.

comment by Gurkenglas · 2019-12-07T13:03:57.992Z · LW(p) · GW(p)

What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-07T22:39:29.289Z · LW(p) · GW(p)

For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.

The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can't get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size ?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called "superrationality".

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-10T12:52:46.072Z · LW(p) · GW(p)

In the previous "population game" setting, we assumed all players are "born" at the same time and learn synchronously, so that they always play against players of the same "age" (history length). Instead, we can consider a "mortal population game" setting where each player has a probability to die on every round, and new players are born to replenish the dead. So, if the size of the population is (we always consider the "thermodynamic" limit), players die and the same number of players are born on every round. Each player's utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.

As before, each player type chooses a policy . (We can also consider the case where players of the same type may have different policies, but let's keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: . For each assignment of policies to player types, we get dynamics where . So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.

If we consider only the age distribution, then its evolution doesn't depend on and it always converges to the unique fixed point distribution . Therefore it is natural to restrict the dynamics to the subspace of that corresponds to the age distribution . We denote it .

Does the dynamics have fixed points? can be regarded as a subspace of . The latter is compact (in the product topology) by Tychonoff's theorem and Polish, but is not closed. So, w.r.t. the weak topology on probability measure spaces, is also compact but isn't. However, it is easy to see that is closed in and therefore compact. It may also be regarded as a convex subset of an appropriate Banach space (the dual of the space of Lipschitz functions on some metrization of ). Moreover, it is easy to see is continuous (for populations that are close in the Kantorovich-Rubinstein metric, only the old players may have very different distributions, but old players are a small fraction of the population so their effect on the next round is small). By the Schauder fixed-point theorem, it follows that has a fixed point.

What are the fixed points like? Of course it depends on . In a fixed point, every player observes a sequence of IID plays in all of eir games. Therefore, if satisfies the (very mild!) learning-theoretic desideratum that, upon observing an IID sequence, it converges to optimal response in the limit, then, in the same limit, fixed points are Nash equilibria. This works even for extremely simple learning algorithms, such as "assume the plays in the next game will be sampled from a random past game", and it works for any Bayesian or "quasi-Bayesian" (i.e. using incomplete/fuzzy models [AF · GW]) agent that includes all IID processes in its prior.

This raises a range of interesting questions:

  • Are any/all of the fixed points attractors?
  • Does convergence to a fixed point occur for all or at least almost all initial conditions?
  • Do all Nash equilibria correspond to fixed points?
  • Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?

Mortal population games are obviously reminiscent of evolutionary game theory. However, there are substantial differences. In mortal population games, the game doesn't have to be symmetric, we consider a single policy rather than many competing policies, the policies learn from experience instead of corresponding to fixed strategies, and mortality rate doesn't depend on the reward. In evolutionary game theory, convergence usually cannot be guaranteed. For example, in the rock-scissors-paper game, the population may cycle among the different strategies. On the other hand, in mortal population games, if the game is two-player zero-sum (which includes rock-paper-scissors), and the policy is quasi-Bayesian with appropriate prior, convergence is guaranteed. This is because each player can easily learn to guarantee maximin payoff. Continuity arguments probably imply that at least for small perturbations of zero-sum, there will still be convergence. This leads to some hope that convergence can be guaranteed even in general games, or at least under some relatively mild conditions.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-05-06T08:53:34.537Z · LW(p) · GW(p)

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?

First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?

Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm plus an infinite sequence of bits , s.t. is random in some formal sense (e.g. Martin-Lof) and can decide the output of any finite computation if it's also given access to ?

The answer to the question above is "yes"! Indeed, Chaitin's constant is Martin-Lof random. Given access to Chaitin's constant, it is possible to construct a halting oracle, therefore can decide whether the computation halts, and if it does, run it (and if doesn't, output N/A or whatever).

[EDIT: Actually, this is not quite right. The way you use Chaitin's constant to emulate a halting oracle produces something that's only guaranteed to halt if you give it the correct Chaitin's constant.]

But, this is a boring solution. In practice we are interested at efficient methods of answering mathematical questions, and beliefs acquired by resource bounded agents. Hence, the question becomes: given a resource bound (e.g. a bound on space or time complexity), is it possible to have and similar to above, s.t. respects the bound and is pseudorandom in some formal sense w.r.t. the bound ?

[EDIT: I guess that the analogous thing to the unbounded setting would be, only has to respect when given the correct . But the real conclusion is probably that we should look for something else instead, e.g. some kind of infradistribution.]

This is a fun question, because any answer would be fascinating in its own way: either computable mathematics has finite content in some strong formal sense (!) or mathematics is infinitely sophisticated in some formal sense (!)

We can also go in the other direction along the "hierarchy of feasibility", although I'm not sure how useful is that. Instead of computable mathematics, let's consider determining the truth (not provability, but actual truth) of sentences in e.g. Peano Arithmetic. Does and as above still exist? This would require e.g. a Martin-Lof random sequence which allows making any finite number of Turing jumps.

Replies from: AlexMennen
comment by AlexMennen · 2022-05-06T17:44:45.564Z · LW(p) · GW(p)

Wikipedia claims that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-11T19:53:18.287Z · LW(p) · GW(p)

Epistemic status: no claims to novelty, just (possibly) useful terminology.

[EDIT: I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment [LW(p) · GW(p)].]

I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

Class II: Systems that only ever receive synthetic data that has nothing to do with the real world

Examples:

  • AI that is trained to learn Go by self-play
  • AI that is trained to prove random mathematical statements
  • AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions
  • AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machine with bounded runtime

Class II systems by and large don't admit any attack vectors. [EDIT: Inaccurate, see child comment [LW(p) · GW(p)]]

Such systems might have higher effective class if bugs in the implementation lead to real-world data leaking into the system, or if the ostensibly synthetic data reveals something important about the world via the choices made by its designers (for example, a video game with complex rules inspired by the real world).

Class III: Systems for which there is no distribution shift between training and deployment, and also the label space is small

Example: AI that is trained to distinguish between cat images and dog images, and the images selected for training are a perfectly representative sample of images used in deployment.

Class III systems admit attacks by non-Cartesian daemons.

If the label space is large, a Cartesian daemon can choose a low probability for randomly producing a malign label, such that there is a significant probability that this won't happen in training but will happen in deployment. This moves the system to class IV. If there is distribution shift, a Cartesian daemon can distinguish between training and deployment and use it to perform a "treacherous turn" attack. This also moves the system to class IV.

Such systems have lower effective class if non-Cartesian daemons are mitigated, for example by well-designed applications of homomorphic cryptography. They have higher effective class if deployed in a setting which does involve distributional shift, perhaps unanticipated by the designers.

Class IV: Systems which are non-agentic but do involve distribution shift or large label space

Examples:

  • AI that learns to imitate humans
  • AI that learns to predict the stock market
  • Generative language models

Class IV systems admit attacks by Cartesian and non-Cartesian daemons. [EDIT: Also attack from counterfactuals [LW(p) · GW(p)]. The latter requires a large label space and doesn't require a distribution shift per se.]

Such systems have lower effective class if Cartesian daemons are mitigated, for example by carefully shaping the prior / inductive bias and applying some sort of confidence threshold / consensus algorithm. They can be effective class V if not designed to avoid self-fulfilling prophecies and/or incentives to mispredict at present to improve prediction in the future.

Class V: Agentic systems

Examples:

  • AI that trades in the stock market
  • AI that optimizes long-term ad revenue
  • AI that defends the world against unaligned AIs

Class V systems admit attacks by daemons but are also dangerous by default due to divergence of their utility function from the human utility function.

Such system can have lower effective class if the utility function is very cleverly designed, for example to reliably accomplish learning of human values.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-20T09:43:27.248Z · LW(p) · GW(p)

The idea comes from this [LW(p) · GW(p)] comment of Eliezer.

Class II or higher systems might admit an attack vector by daemons that infer the universe from the agent's source code. That is, we can imagine a malign hypothesis that makes a treacherous turn after observing enough past actions to infer information about the system's own source code and infer the physical universe from that. (For example, in a TRL setting it can match the actions to the output of a particular program for envelope.) Such daemons are not as powerful as malign simulation hypotheses, since their prior probability is not especially large (compared to the true hypothesis), but might still be non-negligible. Moreover, it is not clear whether the source code can realistically have enough information to enable an attack, but the opposite is not entirely obvious.

To account for this I propose the designate class I systems which don't admit this attack vector. For the potential sense, it means that either (i) the system's design is too simple to enable inferring much about the physical universe, or (ii) there is no access to past actions (including opponent actions for self-play) or (iii) the label space is small, which means an attack requires making many distinct errors, and such errors are penalized quickly. And ofc it requires no direct access to the source code.

We can maybe imagine an attack vector even for class I systems, if most metacosmologically [LW(p) · GW(p)] plausible universes are sufficiently similar, but this is not very likely. Nevertheless, we can reserve the label class 0 for systems that explicitly rule out even such attacks.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-02-01T12:23:52.618Z · LW(p) · GW(p)

I find it interesting to build simple toy models of the human utility function. In particular, I was thinking about the aggregation of value associated with other people. In utilitarianism this question is known as "population ethics" and is infamously plagued with paradoxes. However, I believe that is the result of trying to be impartial. Humans are very partial and this allows coherent ways of aggregation. Here is my toy model:

Let Alice be our viewpoint human. Consider all social interactions Alice has, categorized by some types or properties, and assign a numerical weight to each type of interaction. Let be the weight of the interaction person had with person at time (if there was no interaction at this time then ). Then, we can define Alice's affinity to Bob as

Here is some constant. Ofc can be replaced by many other functions.

Now, we can the define the social distance of Alice to Bob as

Here is some constant, and the power law was chosen rather arbitrarily, there are many functions of that can work. Dead people should probably count in the infimum, but their influence wanes over time since they don't interact with anyone (unless we count consciously thinking about a person as an interaction, which we might).

This is a time-dependent metric (or quasimetric, if we allow for asymmetric interactions such as thinking about someone or admiring someone from afar) on the set of people. If is bounded and there is a bounded number of people Alice can interact with at any given time, then there is some s.t. the number of people within distance from Alice is . We now define the reward as

Here is some constant and is the "welfare" of person at time , or whatever is the source of value of people for Alice. Finally, the utility function is a time discounted sum of rewards, probably not geometric (because hyperbolic discounting is a thing). It is also appealing to make the decision rule to be minimax-regret over all sufficiently long time discount parameters, but this is tangential.

Notice how the utility function is automatically finite and bounded, and none of the weird paradoxes of population ethics and infinitary ethics crop up, even if there is an infinite number of people in the universe. I like to visualize people space a tiling of hyperbolic space, with Alice standing in the center of a Poincare or Beltrami-Klein model of it. Alice's "measure of caring" is then proportional to volume in the model (this probably doesn't correspond to exactly the same formula but it's qualitatively right, and the formula is only qualitative anyway).

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-05T16:54:10.813Z · LW(p) · GW(p)

Some thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions [AF · GW] are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete[2] Markov chain, the reward function being a function of the state. Examples:

  • The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and high-energy phenomena.

  • The Markov chain is a representation of people and social interactions. The reward correspond to concepts like "happiness" or "friendship" et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.

  • The Markov chain is Botworld with some of the rules left unspecified. The reward is the total number of a particular type of item.

Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori non-Cartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don't want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.

Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By "refinement", I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the "ontological belief". For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a "objective" prior and a Cartesian boundary, we can construct a "subjective" prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to "Nirvana [AF(p) · GW(p)]" state. Alternatively, we can allow for stochastic action selection and use the gambler construction [AF(p) · GW(p)].

Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By "indirectly observable" I mean something like with semi-instrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being long-term), otherwise it won't have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent "unspecial". For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of . On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayes-optimality or some approximation thereof [AF(p) · GW(p)]. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.

Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of "embeddedness". It seems like this can be naturally considered a special case of the previous construction, if we allow the "ontological belief" to include beliefs pertaining to computations.


  1. Unless it's some kind of modification that we treat explicitly in our model of the agent, for example a TRL [AF(p) · GW(p)] agent reprogramming its own envelope. ↩︎

  2. "Incomplete" in the sense of Knightian uncertainty, like in quasi-Bayesian RL [AF(p) · GW(p)]. ↩︎

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-03-24T14:08:33.763Z · LW(p) · GW(p)

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. But, so far I haven't studied quasi-Bayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.

Consider a multi-armed bandit, where the arm space is . First, consider the follow realizable setting: the reward is a deterministic function which is known to be a polynomial of degree at most. In this setting, learning is fairly easy: it is enough to sample arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is low-dimensional in the appropriate sense.

Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a low-dimensional space (one-dimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.

Finally, consider the following quasi-realizable setting: each incomplete hypothesis in our class states that the reward function is lower-bounded by a particular polynomial of degree at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven't derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low ). This is likely to be a special case of some general result on quasi-Bayesian learning with low-dimensional priors.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-09T12:26:00.978Z · LW(p) · GW(p)

Master post for alignment protocols.

Other relevant shortforms:

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-04-12T11:32:47.375Z · LW(p) · GW(p)

Precursor Detection, Classification and Assistance (PreDCA)

Infra-Bayesian physicalism [AF · GW] provides us with two key building blocks:

  • Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
  • Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the "evaluating agent" section of the article).

I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:

  • For each hypothesis in the prior, check which agents are precursors of our agent according to this hypothesis.
  • Among the precursors, check whether some are definitely neither humans nor animals nor previously created AIs.
  • If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).
  • If there are no precursors like that, decide which of them are humans.
  • Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).

Detection

How to identify agents which are our agent's precursors? Let our agent be and let be another agents which exists in the universe according to hypothesis [2]. Then, is considered to be a precursor of in universe when there is some -policy s.t. applying the counterfactual " follows " to (in the usual infra-Bayesian sense) causes not to exist (i.e. its source code doesn't run).

A possible complication is, what if implies that creates / doesn't interfere with the creation of ? In this case might conceptually be a precursor, but the definition would not detect it. It is possible that any such would have a sufficiently large description complexity penalty that it doesn't matter. On the second hand, if is unconditionally Knightian uncertain about creating then the utility will be upper bounded by the scenario in which doesn't exist, which is liable to make an effectively falsified hypothesis. On the third hand, it seems plausible that the creation of by would be contingent on 's behavior (Newcomb-style, which we know how it works in infra-Bayesianism), in which case is not falsified and the detection works. In any case, there is a possible variant of the definition to avoid the problem: instead of examining only we also examine coarsenings of which are not much more complex to describe (in the hope that some such coarsening would leave the creation of uncertain).

Notice that any agent whose existence is contingent on 's policy cannot be detected as a precursor: the corresponding program doesn't even "run", because we don't apply a -policy-counterfactual to the bridge transform.

Classification

How to decide which precursors are which? One tool we have is the parameter and the computational resource parameters in the definition of intelligence. In addition we might be able to create a very rough neuroscience-based model of humans. Also, we will hopefully have a lot of information about other AIs that can be relevant. Using these, it might be possible to create a rough benign/malign/irrelevant classifier, s.t.

  • Humans are classified as "benign".
  • Most (by probability mass) malign simulation hypotheses contain at least one precursor classified as "malign".
  • Non-human agents that exist in the causal past of our AI in the null (non-simulation) hypothesis are classified as "irrelevant".

Assistance

Once we detected and classified precursors in each hypothesis, we discard all hypotheses that contain malign precursors. In the remaining hypotheses, we perform some kind of aggregation on the utility functions of the benign precursors (for example, this [LW(p) · GW(p)]). The utility functions from different hypotheses are somehow normalized to form the overall utility function. Alternatively, we do a maximal lottery vote for the policy, where each hypothesis is a voter with weight proportional to its prior probability mass.

Inner Alignment

Why can this solve inner alignment? In any model-based approach, the AI doesn't train the policy directly. Instead, it trains models and uses them to compute the policy. I suspect that the second step cannot create mesa-optimizers, since it only involves control and not learning[3]. Hence, any mesa-optimizer has to originate from the first step, i.e. from the model/hypothesis. And, any plausible physicalist hypothesis which contains a mesa-optimizer has to look like a malign simulation hypothesis.

Outer Alignment

Why can this solve outer alignment? Presumably, we are aggregating human utility functions. This doesn't assume humans are perfect agents: can be less than infinity. I suspect that when the utility function becomes somewhat ambiguous, but the ambiguity can probably be resolved arbitrarily or maybe via a risk-averse method. What if the AI modifies the humans? Then only pre-modification humans are detected as precursors, and there's no problem.

Moreover, the entire method can be combined with the Hippocratic principle [AF(p) · GW(p)] to avoid catastrophic mistakes out of ignorance (i.e. to go from intent alignment to impact alignment).


  1. We do need a lot more research to fully specify this "utility reconstruction" and check that it satisfies reasonable desiderata. But, the existence of a natural utility-function-dependent measure of intelligence suggests it is possible. ↩︎

  2. I'm ignoring details like "what if only exists with certain probability". The more careful analysis is left for later. ↩︎

  3. In modern deep RL systems, there might not be a clear line between learning and control. For example, if we use model-free RL to produce the policy for a given hypothesis, then there is learning happening there as well. In such an architecture, the value function or Q-function should be regarded as part of the hypothesis for our purpose. ↩︎

Replies from: ViktoriaMalyasova, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy
comment by ViktoriaMalyasova · 2022-08-08T18:20:40.052Z · LW(p) · GW(p)

Then,  is considered to be a precursor of  in universe  when there is some -policy  s.t. applying the counterfactual " follows " to  (in the usual infra-Bayesian sense) causes  not to exist (i.e. its source code doesn't run).

A possible complication is, what if  implies that  creates  / doesn't interfere with the creation of ? In this case  might conceptually be a precursor, but the definition would not detect it.

Can you please explain how does this not match the definition? I don't yet understand all the math, but intuitively, if H creates G / doesn't interfere with the creation of G, then if H instead followed policy "do not create G/ do interfere with the creation of G", then G's code wouldn't run? 
 

Can you please give an example of a precursor that does match the definition?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-08-09T06:19:49.815Z · LW(p) · GW(p)

The problem is that if implies that creates but you consider a counterfactual in which doesn't create then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it "hard counterfactuals") only makes sense when the condition you're counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post [LW(p) · GW(p)] I suggested solving this by defining "soft counterfactuals" where you consider coarsenings of in addition to itself.

Replies from: ViktoriaMalyasova
comment by ViktoriaMalyasova · 2022-08-16T04:24:58.811Z · LW(p) · GW(p)

Thank you.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-04-13T05:17:00.819Z · LW(p) · GW(p)

Some additional thoughts.

Non-Cartesian Daemons [LW · GW]

These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.

Weaknesses

My main concerns with this approach are:

  • The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent [LW(p) · GW(p)] interactions in particular is required to gain sufficient confidence.

  • The feasibility of a good enough classifier. At present, I don't have a concrete plan for attacking this, as it requires inputs from outside of computer science.

  • Inherent "incorrigibility": once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won't defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified [LW(p) · GW(p)] so much that I'm not sure it is solved (rather than dissolved) even in the Book [LW(p) · GW(p)]. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-08-27T18:10:08.734Z · LW(p) · GW(p)

A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed [LW(p) · GW(p)] seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent "where" it is in the universe. However, on closer examination, the physicalist is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent's interaction with the actual environment but also its interaction with the "envelope" computer. In a sense, the policy can be said to reflects the agent's "conscious thoughts".

This means that specifying an agent requires not only specifying its source code but also the "envelope semantics" (possibly we also need to penalize for the complexity of in the definition of ). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history is -consistent with the variable of the bridge transform. That is, for any we must have for some destiny . In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn't sufficiently strong, since it doesn't seem to establish a causal relation between the manifesting and the agent's observations, but it's at least necessary).

Notice also that the computational power of the envelope implied by becomes another characteristic of the agent's intelligence, together with as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-07-06T14:43:27.470Z · LW(p) · GW(p)

Here's a video of a talk I gave about PreDCA.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-04-27T15:12:47.147Z · LW(p) · GW(p)

Two more remarks.

User Detection

It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.

Given two agents and , which can ask which points on 's timeline are in the causal past of which points of 's timeline. To answer this, consider the counterfactual in which takes a random action (or sequence of actions) at some point (or interval) on 's timeline, and measure the mutual information between this action(s) and 's observations at some interval on 's timeline.

Using this, we can effectively construct a future "causal cone" emanating from the AI's origin, and also a past causal cone emanating from some time on the AI's timeline. Then, "nearby" agents will meet the intersection of these cones for low values of whereas "faraway" agents will only meet it for high values of or not at all. To first approximation, the user would be the "nearest" precursor[1] agent i.e. the one meeting the intersection for the minimal .

More precisely, we expect the user's observations to have nearly maximal mutual information with the AI's actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI's sensors measure every nerve signal emanating from the user's brain? To address this, we can fix to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.

This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.

More on Counterfactuals

In the parent post I suggested "instead of examining only we also examine coarsenings of which are not much more complex to describe". A possible elegant way to implement this:

  • Consider the entire portion of our (simplicity) prior which consists of coarsenings of .
  • Apply the counterfactual to .
  • Renormalize the result from HUC to HUD.

  1. We still need precursor detection, otherwise the AI can create some new agent and make it the nominal "user". ↩︎

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-08-21T11:58:42.935Z · LW(p) · GW(p)

Causality in IBP

There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis , for , we consider its bridge transform . Given some subset of programs we can define then project to [1]. We can then take bridge transform again to get some . The factor now tells us which programs causally affect the manifestation of programs in . Notice that by Proposition 2.8 in the IBP article, when we just get all programs that are running, which makes sense.

Agreement Rules Out Mesa-Optimization

The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann's) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don't lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.

If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user's own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user's behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.

Two nuances:

  • Maybe in practice there's still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.
  • If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn't. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb's theorem allows resolving the self-reference in prisoner dilemma games between FairBots.

  1. There are actually two ways to do this, corresponding to the two natural mappings . The first is just projecting the subset of to a subset of , the second is analogous to what's used in Proposition 2.16 of the IBP article. I'm not entirely sure what's correct here. ↩︎

Replies from: martinsq
comment by Martín Soto (martinsq) · 2022-12-01T10:35:16.723Z · LW(p) · GW(p)

Hi Vanessa! Thanks again for your previous answers. I've got one further concern.

        Are all mesa-optimizers really only acausal attackers?

I think mesa-optimizers don't need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).

Of course, since the only way to change the AGI's actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn't need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).

That is, if we don't think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it's better understood as an alignment failure.

The way I see PreDCA (and this might be where I'm wrong) is as an "outer top-level protocol" which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we've provided is clearly aligned, we're safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that's everything our AGI is really doing).

I don't think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they're running on etc.). But I think your Agreement solution doesn't completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2022-12-03T18:01:07.188Z · LW(p) · GW(p)

First, no, the AGI is not going to "employ complex heuristics to ever-better approximate optimal hypotheses update". The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).

Second, there's the issue of non-cartesian attacks ("hacking the computer"). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won't hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.

Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.

Replies from: martinsq
comment by Martín Soto (martinsq) · 2022-12-03T18:21:37.192Z · LW(p) · GW(p)

The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability.

I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn't contemplating. Thanks again for your kind answers!

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-09T13:30:30.992Z · LW(p) · GW(p)

There's a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD [AF(p) · GW(p)], which we call autocalibrating quantilized RL (AQRL).

First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn't contain terms such as "oh btw don't kill people while you're building the nanosystem". However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the "relative difficulty assumption" (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.

We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.

Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.

AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form "send such-and-such innocuous packet to such-and-such IP address". If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.

Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as "would be labeled by humans as such-and-such" instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.

In order to avoid learning human models, we can use methods along these [AF(p) · GW(p)] lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t. the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn't help with learning the human-centric dataset, implying that the concept models don't contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.

Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big "if" but one we need to deal with anyway).

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-26T17:18:51.954Z · LW(p) · GW(p)

Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky's Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasi-Bayesianism [LW(p) · GW(p)]. This leads me to ask, what are the implications of quasi-Bayesianism on human rationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasi-Bayesianism clarify important confusing issues in regular Bayesianism such as the proper use of inside and outside view? Is there rigorous justification to the intuition that we should have more Knightian uncertainty about questions with less empirical evidence? Does any of it influence various effective altruism calculations in surprising ways? What common LessWrongian wisdom does it undermine, if any?

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-04-22T09:56:51.459Z · LW(p) · GW(p)

Now that it was mentioned in ACX, I really hope the pear ring will become standard in the rationalist/EA community. So, please signal boost it: bootstrapping is everything, obvious network value effects are obvious. Also, it would be nice if they make a poly version sometime (for now, I will make do with wearing it next to my wedding ring ;), and a way to specify sexual orientation (btw, I'm bi, just saying...)

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-28T16:41:44.233Z · LW(p) · GW(p)

Master post for ideas about infra-Bayesianism [? · GW].

Replies from: vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-10-02T16:54:29.827Z · LW(p) · GW(p)

In the anthropic trilemma [LW · GW], Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.

Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the "physical" baseline. Indeed, suppose that a particular observation has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have much more often in their history than the physical frequency with which is encountered, due to the resulting "selection bias". We call this setting "anthropic RL" (ARL).

The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the "physical" perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has finite sets of actions () and states (), and a transition kernel . The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent.

[EDIT: Actually, given that we don't care about the order of robots, the signature of the transition kernel should be ]

The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment and guesses the robot policy until time . Then, a simulation of the dynamics of is performed until time , and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time (we assume Nirvana reward rather than ).

This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for , the population of robots at time has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.

Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of "average" you use. So, your epistemology doesn't have to fix a particular method of counting minds. Instead different counting methods are just different "frames of reference" through which to look, and you can be simultaneously rational in all of them.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2020-12-26T08:50:16.822Z · LW(p) · GW(p)

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a "memory MDP" that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-12-26T16:31:31.278Z · LW(p) · GW(p)

I'm not sure what do you mean by that "unrolling". Can you write a mathematical definition?

Let's consider a simple example. There are two states: and . There is just one action so we can ignore it. is the initial state. An robot transition into an robot. An robot transitions into an robot and an robot. How will our population look like?

0th step: all robots remember

1st step: all robots remember

2nd step: 1/2 of robots remember and 1/2 of robots remember

3rd step: 1/3 of robots remembers , 1/3 of robots remember and 1/3 of robots remember

There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have and . But, to be consistent with step 3 we must have , .

In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.

Or, at least it's not obvious there is such a hypothesis. In this example, will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don't know, maybe for finite state spaces it can work. Would definitely be interesting to check.

[EDIT: actually, in this example there is such a hypothesis but in general there isn't, see below [AF(p) · GW(p)]]

Replies from: Charlie Steiner
comment by Charlie Steiner · 2020-12-27T01:40:41.218Z · LW(p) · GW(p)

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states  with the memory states , etc. The action takes a robot in  to memory state , and a robot in  to one robot in  and another in .

(Skip this paragraph unless the specifics of what's going on aren't obvious: given a transition distribution  (P being the distribution over sets of states s'* given starting state s and policy ), we can define the memory transition distribution  given policy  and starting "memory state"  (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of  into the transition distribution as the current state.  Then for each  in the domain, for each element in  we concatenate that element onto the end of  and collect these  into a set , which is assigned the same probability .)

So now at time t=2, if you sample a robot, the probability that its state begins with  is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it's just that we've turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.

I'm still confused about what you mean by "Bayesian hypothesis" though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-12-27T18:17:30.439Z · LW(p) · GW(p)

I'm not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It's like in classical RL theory, when you're proving a regret bound or whatever, your probability space consists of histories.

I'm still confused about what you mean by "Bayesian hypothesis" though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I'm talking about hypotheses which conform to the classical "cybernetic agent model". If you wish, we can call it "Bayesian cybernetic hypothesis".

Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.

Indeed, consider some . We can take its expected value to get . Assuming the chain is communicating, is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector . We then get the subjective transition kernel:

Now, consider the following example of an AMDP. There are three actions and two states . When we apply to an robot, it creates two robots, whereas when we apply to an robot, it leaves one robot. When we apply to an robot, it creates two robots, whereas when we apply to an robot, it leaves one robot. When we apply to any robot, it results in one robot whose state is with probability and with probability .

Consider the following two policies. takes the sequence of actions and takes the sequence of actions . A population that follows would experience the subjective probability , whereas a population that follows would experience the subjective probability . Hence, subjective probabilities depend on future actions. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2020-12-27T21:52:13.888Z · LW(p) · GW(p)

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.

(edit: The reward function in AMDPs can either be analogous to "wordly" and just sum the reward calculated at individual timesteps, or analogous to "selfish" and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)

I brought up the histories->states thing because I didn't understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?

In fact, to me the example is still somewhat murky, because you're talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions - so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-12-28T07:15:40.010Z · LW(p) · GW(p)

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.

...I brought up the histories->states thing because I didn't understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?

AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are "selfish": there is a reward function which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the "unrolling" you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)

For a more complete framework, we can use an ontology chain [LW(p) · GW(p)], but (i) instead of labels use labels, where is the set of possible memory states (a policy is then described by ), to allow for agents that don't fully trust their memory (ii) consider another chain with a bigger state space plus a mapping s.t. the transition kernels are compatible. Here, the semantics of is: the multiset of ontological states resulting from interpreting the physical state by taking the viewpoints of different agents contains.

In fact, to me the example is still somewhat murky, because you're talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities.

I didn't understand "no actual agent in the information-state that corresponds to having those probabilities". What does it mean to have an agent in the information-state?

Replies from: Charlie Steiner
comment by Charlie Steiner · 2020-12-29T08:54:42.780Z · LW(p) · GW(p)

What does it mean to have an agent in the information-state?

Nevermind, I think I was just looking at it with the wrong class of reward function in mind.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-09-28T17:29:06.071Z · LW(p) · GW(p)

There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory [AF · GW] (MUDT).

Consider a one-shot decision theory setting. There is a set of unobservable states , a set of actions and a reward function . An IBDT agent has some belief [1], and it chooses the action .

We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define , where the semantics of is "the unobservable state is and Omega predicts the agent will take action ". We then define by and by ( is what we call the pullback of to , i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.

The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.

For any , define by . That is, is an infra-Bayesian representation of the belief "Omega will make prediction ". For any , define by . can be interpreted as the belief "assuming Omega is accurate, the expected reward will be at least ".

We will also need to use the order on defined by: when . The reversal is needed to make the analogy to logic intuitive. Indeed, can be interpreted as " implies "[2], the meet operator can be interpreted as logical conjunction and the join operator can be interpreted as logical disjunction.

Claim:

(Actually I only checked it when we restrict to crisp infradistributions, in which case is intersection of sets and is set containment, but it's probably true in general.)

Now, can be interpreted as "the conjunction of the belief and implies ". Roughly speaking, "according to , if the predicted action is then the expected reward is at least ". So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but "holds" is better thought of as "is provable", since we're talking about the agent's belief). Which is exactly the decision rule of MUDT!


  1. Apologies for the potential confusion between as "space of infradistrubutions" and the of modal logic (not used in this post). ↩︎

  2. Technically it's better to think of it as " is true in the context of ", since it's not another infradistribution so it's not a genuine implication operator. ↩︎

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-08-06T05:19:10.798Z · LW(p) · GW(p)

Master post for ideas about infra-Bayesian physicalism [AF · GW].

Other relevant posts:

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-13T12:50:27.064Z · LW(p) · GW(p)

Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it's unclear whether humans should be regarded as such agents. Indeed, monotonicity [AF · GW] is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn't be confused about). This presents a problem for superimitation [AF(p) · GW(p)].

What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents "transcartesian". Here is how this can be formalized as a modification of IBP.

In IBP, a hypothesis is specified by choosing the state space  and the belief . In the transcartesian framework, we require that a hypothesis is augmented by a mapping , where  is the action set of the reference agent (user) and  is the observation set of the reference agent. Given  the source code of the reference agent, we require that  is supported on the set

That is, the actions of the reference agent are indeed computed by the source code of the reference agent.

Now, instead of using a loss function of the form , we can use a loss function of the form  which doesn't have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form   monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards [AF · GW].

As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment [AF · GW] is maintained).

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-02-04T07:45:25.840Z · LW(p) · GW(p)

Up to light editing, the following was written by me during the "Finding the Right Abstractions for healthy systems" research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.

In order to allow (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite . At first, it might seem can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on , but that runs into problems. Instead, the right structure on for defining the bridge transform seems to be that of a "profinite field space": a category I came up with that I haven't seen in the literature so far.

The category of profinite field spaces is defined as follows. An object of is a set and a family of finite sets . We denote . Given and objects of , a morphism from to is a mapping such that there exists with the following properties:

  • For any , the set is finite.
  • For any , the set is finite.
  • For any , there exists a mapping s.t. for any , where is the projection mapping.

The composition of morphisms is just the composition of mappings.

It is easy to see that every morphism is a continuous mapping in the product topology, but the converse is false. However, the converse is true for objects with finite (i.e. for such objects any mapping is a morphism). Hence, an object in can be thought of as equipped with additional structure that is stronger than the topology but weaker than the factorization into .

The name "field space" is inspired by the following observation. Given an object of , there is a natural condition we can impose on a Borel probability distribution on which makes it a “Markov random field” (MRF). Specifically, is called an MRF if there is an undirected graph whose vertices are and in which every vertex is of finite degree, s.t. is an MRF on in the obvious sense. The property of being an MRF is preserved under pushforwards w.r.t. morphisms.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-06-06T17:28:49.828Z · LW(p) · GW(p)

Infra-Bayesian physicalism [AF · GW] is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. "corrigibility is anti-natural to consequentialist reasoning"). Specifically, alignment protocols that don't rely on value learning become vastly less safe when combined with IBP:

  • Example 1: Using steep time discount to disincentivize dangerous long-term plans. For IBP, "steep time discount" just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don't have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.

  • Example 2: Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If and are both monotonic loss functions then is a constant). So, in a "debate" between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input "I won the debate".

  • Example 3: Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.

The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA [AF(p) · GW(p)] or something similar. And, if acausal attack is a serious problem, then approaches that don't do value learning might be doomed.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-04-09T12:16:15.033Z · LW(p) · GW(p)

Infradistributions admit an information-theoretic quantity that doesn't exist in classical theory. Namely, it's a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:

Let be a finite set and a crisp infradistribution (credal set) on , i.e. a closed convex subset of . Then, imagine someone trying to communicate a message by choosing a distribution out of . Formally, let be any other finite set (space of messages), (prior over messages) and (communication protocol). Consider the distribution . Then, the information capacity of the protocol is the mutual information between the projection on and the projection on according to , i.e. . The "Knightian entropy" of is now defined to be the maximum of over all choices of , , . For example, if is Bayesian then it's , whereas if , it is .

Here is one application[1] of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion of the prior consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.

Let be the set of observations and the set of actions (which might be "just" predictions) of our AI, and for any environment and prior , let be the distribution over histories resulting from our algorithm starting with prior and interacting with environment for time steps. We have , where is the malign part of the prior and the benign part. For any , consider . The closure of the convex hull of these distributions for all choices of ("attacker policy") is some . The maximal Knightian entropy of over all admissible and is called the malign capacity of the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of . The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in ) malign capacity bounds.


  1. This is an idea I'm collaborating on with Johannes Treutlein. ↩︎

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-01-16T00:01:52.812Z · LW(p) · GW(p)

Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I'm quite confident the overall idea is sound.

We will work with homogenous ultracontributions [AF(p) · GW(p)] (HUCs). will denote the space of HUCs over . Given , will denote the corresponding convex set. Given and , will mean . Given , will mean .

Syntax

Let denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types by:

  • (intended meaning: the uninhabited type)
  • (intended meaning: the one element type)
  • If then
  • If then (intended meaning: disjoint union)
  • If then (intended meaning: Cartesian product)
  • If then (intended meaning: predicates with argument of type )

For each , there is a set which we interpret as atomic terms of type . We will denote . Among those we distinguish the logical atomic terms:

  • Symbols we will not list explicitly, that correspond to the algebraic properties of and (commutativity, associativity, distributivity and the neutrality of and ). For example, given there is a "commutator" of type .
  • (intended meaning: predicate evaluation)
  • [EDIT: Actually this doesn't work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
  • [EDIT: Actually this doesn't work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
  • Assume that for each there is some : the set of "describable" ultracontributions [EDIT: it is probably sufficient to only have the fair coin distribution in in order for it to be possible to approximate all ultracontributions on finite sets]. If then

We recursively define the set of all terms . We denote .

  • If then
  • If and then
  • If and then
  • If then
  • If and then

Elements of are called formulae. Elements of are called sentences. A subset of is called a theory.

Semantics

Given , a model of is the following data. To each , there must correspond some compact Polish space s.t.:

  • (the one point space)

To each , there must correspond a continuous mapping , under the following constraints:

  • , , and the "algebrators" have to correspond to the obvious mappings.
  • . Here, is the diagonal and is the sharp ultradistribution corresponding to the closed set .
  • Consider and denote . Then, . Here, we use the observation that the identity mapping can be regarded as an infrakernel from to .
  • is the convex hull of
  • is the intersection of
  • Consider and denote , and the projection mapping. Then, .
  • Consider and denote , and the projection mapping. Then, iff for all , if then .
  • .

Finally, for each , we require .

Semantic Consequence

Given , we say when . We say when for any model of , . It is now interesting to ask what is the computational complexity of deciding . [EDIT: My current best guess is co-RE]

Applications

As usual, let be a finite set of actions and be a finite set of observation. Require that for each there is which we interpret as the type of states producing observation . Denote (the type of all states). Moreover, require that our language has the nonlogical symbols (the initial state) and, for each , (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.

Replies from: vanessa-kosoy, vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-01-23T17:05:18.119Z · LW(p) · GW(p)

Let's also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).

0-th order

Syntax

Let be the set of propositional variables. We define the language :

  • Any is also in
  • Given ,
  • Given ,

Notice there's no negation or implication. We define the set of judgements . We write judgements as (" in the context of "). A theory is a subset of .

Semantics

Given , a model of consists of a compact Polish space and a mapping . The latter is required to satisfy:

  • . Here, we define of infradistributions as intersection of the corresponding sets
  • . Here, we define of infradistributions as convex hull of the corresponding sets
  • For any ,

1-st order

Syntax

We define the language using the usual syntax of 1-st order logic, where the allowed operators are , and the quantifiers and . Variables are labeled by types from some set . For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables , we denote the set of formulae whose free variables are a subset of . We define the set of judgements .

Semantics

Given , a model of consists of

  • For every , a compact Polish space
  • For every where have types , an element of , where

It must satisfy the following:

  • Consider variables of types and variables of types . Consider also some s.t. . Given , we can form the substitution . We also have a mapping given by . We require
  • Consider variables and . Denote the projection mapping. We require
  • Consider variables and . Denote the projection mapping. We require that if an only if, for all s.t ,
  • For any ,
Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-01T15:49:24.963Z · LW(p) · GW(p)

There is a special type of crisp infradistributions that I call "affine infradistributions": those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).

To produce some evidence for the conjecture, let's consider a slightly different problem. Specifically, introduce a new semantics in which is replaced by the set of linear subspaces of some finite dimensional vector space . A model is required to satisfy:

  • For any ,

If you wish, this is "non-unitary quantum logic". In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform into an equivalent theory s.t. all judgments are of the following forms:

  • Pairs of the form , .

Here, are propositional variables and "" is a shorthand for the pair of judgments and .

Second, we make sure that our also satisfies the following "closure" properties:

  • If and are in then so is
  • If is in then so are and
  • If is in then so are and
  • If , and are in then so is
  • If , and are in then so is

Third, we assign to each a real-valued variable . Then we construct a linear program for these variables consisting of the following inequalities:

  • For any :
  • For any in :
  • For any pair and in :
  • For any :
  • For any :

Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice that for any model we can construct a solution by setting

I don't have a full proof for the converse but here are some arguments. If a solution exists, then it can be chosen to be rational. We can then rescale it to get integers which are candidate dimensions of our subspaces. Consider the space of all ways to choose subspaces of these dimensions s.t. the constraints coming from judgments of the form are satisfied. This is a moduli space of poset representations. It is easy to see it's non-empty (just let the subspaces be spans of vectors taken from a fixed basis). By Proposition A.2 in Futorny and Iusenko it is an irreducible algebraic variety. Therefore, to show that we can also satisfy the remaining constraints, it is enough to check that (i) the remaining constraints are open (ii) each of the remaining constraints (considered separately) holds at some point of the variety. The first is highly likely and the second is at least plausible.

The algorithm also seems to have a natural extension to the original infra-Bayesian setting.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-01-16T12:03:13.489Z · LW(p) · GW(p)

When using infra-Bayesian logic to define a simplicity prior, it is natural to use "axiom circuits" rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols for repeating terms. This doesn't affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-01-16T11:54:38.406Z · LW(p) · GW(p)

Instead of introducing all the "algebrator" logical symbols, we can define as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:

  • For any and (permutation), denote and require
  • For any and ,

However, if we do this then it's not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-01-24T18:10:58.002Z · LW(p) · GW(p)

Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over , and start from testing the identity of the types appearing inside parentheses), so we can validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-07-24T09:56:56.735Z · LW(p) · GW(p)

Master post for ideas about metacognitive agents [AF · GW].

Replies from: vanessa-kosoy, vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2024-03-25T01:27:56.945Z · LW(p) · GW(p)

Formalizing the richness of mathematics

Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.

Here is my proposal for how to formulate a theorem that would make this idea rigorous.

(Wrong) First Attempt

Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata [LW · GW]. Each such hypothesis  represents an infradistribution over : the "space of counterpossible computational universes". We can say that  is a "true hypothesis" when there is some  in the credal set  (a distribution over ) s.t. the ground truth  "looks" as if it's sampled from . The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.

We can now try to say that  is "rich" if for any true hypothesis , there is a refinemen which is also a true hypothesis and "knows" at least one bit of information that  doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes . But, it's also completely boring: the required  can be constructed by "hardcoding" an additional fact into . This doesn't look like "discovering interesting structure", but rather just like brute-force memorization.

(Wrong) Second Attempt

What if instead we require that  knows infinitely many bits of information that  doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the -th bit in the sequence depends only the largest  s.t.  divides , but the dependence on  is already uncomputable (or at least inexpressible by a finite automaton).

However, for our original application, this is entirely insufficient. This is because in the formal language we use to define  (e.g. combinator calculus) has some "easy" equivalence relations. For example, consider the family of programs of the form "if 2+2=4 then output 0, otherwise...". All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many "new" bits of information. Once again, we are left with brute-force memorization.

(Less Wrong) Third Attempt

Here's the improved condition: For any true hypothesis , there is a true refinement  s.t. conditioning  on any finite set of observations cannot produce a refinement of .

There is a technicality here, because we're talking about infradistributions, so what is "conditioning" exactly? For credal sets, I think it is sufficient to allow two types of "conditioning":

  • For any given observation  and , we can form .
  • For any given observation  s.t. , we can form .

This rules-out the counterexample from before: the easy equivalence relation can be represented inside , and then the entire sequence of "novel" bits can be generated by a conditioning.

Alright, so does  actually satisfy this condition? I think it's very probable, but I haven't proved it yet. 

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-08-04T05:07:18.558Z · LW(p) · GW(p)

Recording of a talk I gave in VAISU 2023.

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-07-24T11:07:15.596Z · LW(p) · GW(p)

Here is the sketch of a simplified model for how a metacognitive agent deals with traps.

Consider some (unlearnable) prior  over environments, s.t. we can efficiently compute the distribution  over observations given any history . For example, any prior over a small set of MDP hypotheses would qualify. Now, for each , we regard  as a "program" that the agent can execute and form beliefs about. In particular, we have a "metaprior"  consisting of metahypotheses: hypotheses-about-programs. 

For example, if we let every metahypothesis be a small infra-RDP satisfying appropriate assumptions, we probably have an efficient "metalearning" algorithm. More generally, we can allow a metahypothesis to be a learnable mixture of infra-RDPs: for instance, there is a finite state machine for specifying "safe" actions, and the infra-RDPs in the mixture guarantee no long-term loss upon taking safe actions.

In this setting, there are two levels of learning algorithms:

  • The metalearning algorithm, which learns the correct infra-RDP mixture. The flavor of this algorithm is RL in a setting where we have a simulator of the environment (since we can evaluate  for any ). In particular, here we don't worry about exploitation/exploration tradeoffs.
  • The "metacontrol" algorithm, which given an infra-RDP mixture, approximates the optimal policy. The flavor of this algorithm is "standard" RL with exploitation/exploration tradeoffs.

In the simplest toy model, we can imagine that metalearning happens entirely in advance of actual interaction with the environment. More realistically, the two needs to happen in parallel. It is then natural to apply metalearning to the current environmental posterior rather than the prior (i.e. the histories starting from the history that already occurred). Such an agent satisfies "opportunistic" guarantees: if at any point of time, the posterior admits a useful metahypothesis, the agent can exploit this metahypothesis. Thus, we address both parts of the problem of traps:

  • The complexity-theoretic part (subproblem 1.2) is addressed by approximating the intractable Bayes-optimality problem by the metacontrol problem of the (coarser) metahypothesis.
  • The statistical part (subproblem 2.1) is addressed by opportunism: if at some point, we can easily learn something about the physical environment, then we do.
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-20T15:17:13.569Z · LW(p) · GW(p)

Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver [LW · GW] (AMD) problem. Good question! Here is what I wrote in response:

Philosophically, I believe that it is only meaningful to talk about a decision problem when there is also some mechanism for learning the rules of the decision problem. In ordinary Newcombian problems, you can achieve this by e.g. making the problem iterated. In AMD, iteration doesn't really help because the driver doesn't remember anything that happened before. We can consider a version of iterated AMD where the driver has a probability  to remember every intersection, but they always remember whether they arrived at the right destination. Then, it is equivalent to the following Newcombian problem: 

  • With probability , counterfactual A happens, in which Omega decides about both intersections via simulating the driver in counterfactuals B and C.
  • With probability , counterfactual B happens, in which the driver decides about the first intersection, and Omega decides about the second intersection via simulating the driver in counterfactual C.
  • With probability , counterfactual C happens, in which the driver decides about the second intersection, and Omega decides about the first intersection via simulating the driver in counterfactual B.

For this, an IB agent indeed learns the updateless optimal policy (although the learning rate carries an  penalty).

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-02-03T18:05:10.065Z · LW(p) · GW(p)

The following was written by me during the "Finding the Right Abstractions for healthy systems" research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.

Here's an elegant diagrammatic notation for constructing new infrakernels out of given infrakernels. There is probably some natural category-theoretic way to think about it, but at present I don't know what it is.

By “infrakernel” we will mean a continuous mapping of the form , where and are compact Polish spaces and is the space of credal sets (i.e. closed convex sets of probability distributions) over .

Syntax

  • The diagram consists of child vertices, parent vertices, squiggly lines, arrows, dashed arrows and slashes.
  • There can be solid arrows incoming into the diagram. Each such arrow is labeled by a compact Polish space and ends on a parent vertex . And, (i.e. the arrow has no source vertex).
  • There can be dashed and solid arrows between vertices. Each such arrow starts from a child vertex and ends on a parent vertex . We require that (i.e. they should not be also connected by a squiggly line).
  • There are two types of vertices: parent vertices (denoted by a letter) and child vertices (denoted by a letter or number in a circle).
    • Each child vertex is labeled by a compact Polish space and connected (by a squiggly line) to a unique parent vertex . It may or may not be crossed-out by a slash.
    • Each parent vertex is labeled by an infrakernel with source and target where each is corresponds to a solid arrow with and each is for some child vertex with . We can also add squares with numbers where solid arrows end to keep track of the correspondence between the arguments of and the arrows.
      • If then the corresponding is .
      • If then the corresponding is .

Semantics

  • Every diagram represents an infrakernel .
    • The source space of is a product , where each is ) for some solid arrow with .
    • The target space of is a product , where each is for some non-crossed-out child vertex.
  • The value of the at a given point is defined as follows. Let (a product that includes the cross-out vertices). Then, is the set of all the marginal distributions of distributions satisfying the following condition. Consider any parent vertex . Let be the (dashed or solid) arrows s.t. and . For each s.t., choose any . We require that contains the marginal distribution of . Here, the notation means we are using the components of and corresponding to solid arrows with .
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-10T18:32:30.178Z · LW(p) · GW(p)

Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.

Model 1: Let be the space of arms, the space of outcomes, the reward function, and vector spaces, the hypothesis space and a function s.t. for any fixed and , extends to some linear operator . The semantics of hypothesis is defined by the equation (i.e. an outcome of action is consistent with hypothesis iff this equation holds).

For any denote by the reward promised by :

Then, there is an algorithm with mistake bound , as follows. On round , let be the set of unfalsified hypotheses. Choose optimistically, i.e.

Choose the arm recommended by hypothesis . Let be the outcome we observed, the reward we received and the (unknown) true hypothesis.

If then also (since and hence ) and therefore wasn't a mistake.

If then (if we had then the minimization in the definition of would include ). Hence, . This implies . Obviously this can happen at most times.

Model 2: Let the spaces of arms and hypotheses be

Let the reward be the only observable outcome, and the semantics of hypothesis be . Then, the sample complexity cannot be bound by a polynomial of degree that doesn't depend on . This is because Murphy can choose the strategy of producing reward whenever . In this case, whatever arm you sample, in each round you can only exclude ball of radius around the sampled arm. The number of such balls that fit into the unit sphere is . So, normalized regret below cannot be guaranteed in less than that many rounds.

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-05T10:24:28.397Z · LW(p) · GW(p)

One of the postulates of infra-Bayesianism is the maximin decision rule. Given a crisp infradistribution , it defines the optimal action to be:

Here is the utility function.

What if we use a different decision rule? Let and consider the decision rule

For we get the usual maximin ("pessimism"), for we get maximax ("optimism") and for other values of we get something in the middle (we can call "-mism").

It turns out that, in some sense, this new decision rule is actually reducible to ordinary maximin! Indeed, set

Then we get

More precisely, any pessimistically optimal action for is -mistically optimal for (the converse need not be true in general, thanks to the arbitrary choice involved in ).

To first approximation it means we don't need to consider -mistic agents since they are just special cases of "pessimistic" agents. To second approximation, we need to look at what the transformation of to does to the prior. If we start with a simplicity prior then the result is still a simplicity prior. If has low description complexity and is not too small then essentially we get full equivalence between "pessimism" and -mism. If is small then we get a strictly "narrower" prior (for we are back at ordinary Bayesianism). However, if has high description complexity then we get a rather biased simplicity prior. Maybe the latter sort of prior is worth considering.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-21T12:44:05.142Z · LW(p) · GW(p)

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communication protocol between the AI and the user that will allow transmitting this knowledge state to the AI (including knowledge about the user's values). Dialogic RL attacks this problem in the manner which seems the most straightforward and powerful: allowing the AI to ask the user questions in some highly expressive formal language, which we will denote .

allows making formal statements about a formal model of the world, as seen from the AI's perspective. includes such elements as observations, actions, rewards and corruption. That is, reflects (i) the dynamics of the environment (ii) the values of the user (iii) processes that either manipulate the user, or damage the ability to obtain reliable information from the user. Here, we can use different models of values: a traditional "perceptible" reward function, an instrumental reward function [AF · GW], a semi-instrumental reward functions, dynamically-inconsistent rewards [AF(p) · GW(p)], rewards with Knightian uncertainty etc. Moreover, the setup is self-referential in the sense that, also reflects the question-answer interface and the user's behavior.

A single question can consist, for example, of asking for the probability of some sentence in or the expected value of some expression of numerical type in . However, in order to address important features of the world, such questions have to be very complex. It is infeasible to demand that the user understands such complex formal questions unaided. Therefore, the AI always produces a formal question together with a natural language () annotation . This annotation has to explain the question in human understandable terms, and also convince the user that is indeed an accurate natural language rendering of . The user's feedback then consists of (i) accepting/rejecting/grading the annotation (ii) answering the question if the annotation is correct and the user can produce the answer. Making this efficient requires a process of iteratively constructing a correspondence between and , i.e effectively building a new shared language between the user and the AI. We can imagine concepts defined in and explained in that serve to define further, more complex, concepts, where at each stage the previous generation of concepts can be assumed given and mutually understandable. In addition to such intensional definitions we may also allow extensional definitions, as long as the generalization is assumed to be via some given function space that is relatively restricted (e.g. doesn't admit subagents). There seem to be some strong connections between the subproblem of designing the annotation system and the field of transparency in AI.

The first major concern that arises at this point is, questions can serve as an attack vector. This is addressed by quantilization. The key assumption is: it requires much less optimization power to produce some useful question than to produce a malicious question. Under this assumption, the quantilization parameter can be chosen to make the question interface safe but still effective. Over time, the agent accumulates knowledge about corruption dynamics that allows it to steer even further away from malicious questions while making the choice of questions even more effective. For the attack vector of deceitful annotations, we can improve safety using the debate approach [AF(p) · GW(p)], i.e. having the agent to produce additional natural language text that attempts to refute the validity of the annotation.

Of course, in addition to the question interface, the physical interface (direct interaction with environment) is also an attack vector (like in any RL system). There, safety is initially guaranteed by following a baseline policy (which can be something like "do nothing" or human imitation). Later, the agent starts deviating from the baseline policy while staying safe, by leveraging the knowledge it previously gained through both the question and the physical interface. Besides being safe, the algorithm also need to be effective, and for this it has to (in particular) find the learning strategy that optimally combines gaining knowledge through the question interface and gaining knowledge through autonomous exploration.

Crucially, we want our assumptions about user competence to be weak. This means that, the user can produce answers that are (i) incomplete (just refuse to answer) (ii) fickle (change eir answers) and (iii) inconsistent (contradictory answers). We address (i) by either assuming that the answerable questions are sufficient or requiring a weaker regret bound where the reference agents knows all obtainable answers rather than knowing the full knowledge state they refer to. We address (ii) by believing later answers over earlier answers and via the commitment mechanism (see below). We address (iii) by explaining the contradictions to the user and asking for their resolution. In addition to all that, the AI can predict the answers instead of actually asking (i.e. the user's knowledge state is also assumed to be regular to some extent). The commitment mechanism is a key tool for achieving optimal trade-off between allowing the user more time to think vs. acting quickly enough to address external dangers, while keeping the user informed of the consequences. It works by having the AI ask the user for a committed answer at the point of time when the AI has to commit to an irreversible course of action (and not much earlier), based on this particular answer. The AI provides the user with a a description of the Pareto frontier for the different possible answers. Moreover, even committed answers can be changed later. In this case, the algorithm cannot guarantee global optimality but it should still guarantee "subgame perfection": i.e., whatever irreversible mistakes were made, the AI recovers as best as possible.

In order to improve the user's ability to produce answers and stay informed about the AI's strategy, we introduce another interface through which the user can ask questions from the AI. In this direction, the user asks the question in and the AI both translates it to and provides a justification (as well as a refutation) of the translation, again employing quantilization. The user can then accept or reject the translation. If the translation is rejected, ey may require the AI to try again. If it is accepted, the AI produces an answer to the best of its ability. In general, it may be that producing an answer requires compiling a formal proof in which case the AI may or may not succeed. Here, there is concern about the scenario where (i) the user asks a question about a possible future (ii) the AI answers with a detailed description of the future (iii) as a result, the user is exposed to infohazards (sources of corruption) within this future. To address this, we need to either restrict the interface to small volumes of data, or shape the prior s.t. the AI can on its own realize the inherent danger. In the latter approach, the AI can refuse to answer or provide a "censored" answer while pointing out the presence of a potential infohazard.

Finally, the subjective regret bound approach might seem to impose a strong cap on capability: if the AI follows a policy optimal w.r.t. the user's knowledge state, then the limited computational abilities of the user might prevent the AI from creating models of the world that are more detailed. This can be addressed in a Turing Reinforcement Learning [AF(p) · GW(p)] setting, by allowing the user's knowledge state to refer to arbitrary computations. The user can then be incapable of executing those computations, but the AI might be more able to do so.

Replies from: vanessa-kosoy, vanessa-kosoy, Gurkenglas
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-20T14:18:54.689Z · LW(p) · GW(p)

I gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion Day, and there is a recording.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-18T13:48:47.486Z · LW(p) · GW(p)

A variant of Dialogic RL with improved corrigibility. Suppose that the AI's prior allows a small probability for "universe W" whose semantics are, roughly speaking, "all my assumptions are wrong, need to shut down immediately". In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question "W?" is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won't arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.

Now, this won't help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn't ensure safety in itself, just hardens a system that is supposed to be already close to safe.

It would be nice if we could replace "shutdown" by "undo everything you did and then shutdown" but that gets us into thorny specifications issues. Perhaps it's possible to tackle those issues by one of the approaches to "low impact".

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-28T18:20:22.446Z · LW(p) · GW(p)

Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional "sanity test" precaution.

comment by Gurkenglas · 2019-10-21T15:08:00.147Z · LW(p) · GW(p)

This design is made of so many parts! It might benefit from a proof that it is the universal answer to a formal question, which I expect to seem less overly complex.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-21T16:46:29.686Z · LW(p) · GW(p)

I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.

Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.

The core idea of having a dialogue comes from modeling the user as a "linguistic agent". Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents "values" are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.

The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user's behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.

The need to deal with "incoherent" answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the "distributed AI" framework, or by some evolutionary argument.

The need to translate between formal language and natural languages come from, not knowing the "communication protocol" of the "nodes". Formalizing this idea further requires some more detailed model of what "natural language" is, which might be possible via multi-agent learning theory.

Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that's an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.

comment by Vanessa Kosoy (vanessa-kosoy) · 2022-09-17T16:26:49.698Z · LW(p) · GW(p)

A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:

  • In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two can be optimized more or less independently.
  • In real life, planning can often be decomposed across timescales, s.t. you don't need to make short timescale plans for steps that only happen later on the long timescale. For example, if I'm in the process of planning a trip to Paris, I might need to worry about (i) booking hotel and tickets (long timescale), (ii) navigating the website I'm using to find a flight (medium timescale) and (iii) moving my finger towards the correct key for entering some specific text into a field (short timescale). But I don't need to worry about walking down the escalator in the airport at this moment.

Here's an attempt to formalize these properties.

We will define a certain formal language for describing environments. These environments are going to be certain asymptotic regions in the space of MDPs.

  • Each term has a type which consists of a tuple of inputs and a single output . Each input is a associated with an HV-polytope[2] . The output is associated with an H-polytope[3] . The inputs represent action spaces (to get a discrete action set, we use the simplex of probability distributions on this set). The output represents the space of admissible equilibria.
  • The atomic terms are finite communicating[4] MDPs, in which each state is associated with a particular input and a transition kernel which has to be an affine mapping. For an atomic term, is the polytope of stationary state-action distributions. Notice that it's efficiently computable.
  • Given two terms and , we can construct a new term . We set . This represents a process made of two independent parts.
  • Given a term , terms and surjective affine mappings , we can construct a new term . This represents an environment governed by on long timescales and by on short timescales. Notice that it's possible to efficiently verify that is a surjection, which is why we use HV-polytopes for inputs[5].

It might be useful to think of as vertical composition and as horizontal composition, in the category-theoretic sense.

In order to assign semantics to this language, we need to define the environment associated with each term . We will do so by assigning a state space , each state an input (which determines the action space at this state) and a transition kernel. This is done recursively:

For the atomic terms, it is straightforward.

For :

  • . Here, the last factor represents which subenvironment is active. This is needed because we want the two subenvironments to be asynchronous, i.e. their time dynamics don't have to be in lockstep.
  • The transition kernel at is defined by updating according to the transition kernel of and then changing according to some arbitrary probabilistic rule, as long as this rule switches the active subenvironment sufficiently often. The degrees of freedom here are one reason we get an asymptotic region in MDP-space rather than a specific MDP.

For :

  • , where we abuse notation to identify the input with its index inside the tuple.
  • is extended from in the obvious way.
  • Given and , the -transition kernel at is defined by (i) with high probability, is updated according to the transition kernel of (ii) with low probability, is updated according to the transition kernel of , where the action is determined by the frequency of state-action pairs since the last type II transition: it is easy to see that is always a polytope in an appropriately defined space of state-action distributions.

The upshot is that, given a list of term definitions (which has a structure similar to a directed acyclic graph, since the definition of each term can refer to previously defined terms), we get an environment that can have an exponentially large number of states, but the control problem can be solved in time polynomial in the size of this description, given some assumptions about the reward function. Specifically, we "decorate" our terms with reward functions in the following way:

  • For atomic terms, we just specify the reward function in the straightforward way.
  • For , we specify some . The reward is then a linear combination of the individual rewards with these coefficients (and doesn't depend on which subenvironment is active).
  • For a term of the form , we need that for some affine which is part of the decoration. This can be validated efficiently (here it's important again that the input is an HV-polytope). In addition, we specify some and the reward a linear combination with these coefficients of the -reward and the -reward.

For timescale decomposition, this planning algorithm can be regarded as formalization of instrumental goals.

An important problem is, understanding the sample complexity of learning hypothesis classes made of such environments. First in the unbounded case and then with polynomial-time learning algorithms.


  1. "Control" means finding the optimal policy given known transition kernel and reward function. ↩︎

  2. An HV-polytope is a polytope described by a list of inequalities and a list of vertices (notice that it's possible to efficiently validate such a description). ↩︎

  3. An H-polytope is a polytope described by list of inequalities. ↩︎

  4. Maybe we can drop this requirement and use the polytope of reachable stationary state-action distributions for . ↩︎

  5. According to Tiwary 2008, projection of H-polytopes is NP-hard even in the output-sensitive sense, but for non-degenerate projection directions it is output-sensitive polynomial time. In particular, this means we should be able to efficiently verify surjectivity in the non-degenerate case even for H-polytopes on the inputs. However, the proof given there seems poorly written and the paper is not peer reviewed AFAICT. ↩︎

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-24T20:05:25.465Z · LW(p) · GW(p)

Epistemic status: most elements are not new, but the synthesis seems useful.

Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate[1] is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quantilization postulates that the true utility function is the difference between the proxy utility function and an unknown adversarial cost function with bounded expected value w.r.t. some base distribution. For debate, we can imagine the cost function to measure the extent to which a particular utterance is Lovecraftian. It also protects us from Cartesian daemons. There are several candidates for the base distribution:

  • Uniform
  • Some sort of bounded simplicity prior. This amounts to postulating that a Lovecraftian utterance is unlikely to be produced by a randomized algorithm of low description and computational complexity.
  • The distribution of utterances produced by human debaters, acquired via imitation learning (this requires hardening[2] against daemons)

The issue with this is that it might be hard to choose the quantilization parameter. An error in one direction produces a system which is too weak for pivotal acts. An error in the other direction produces an unsafe system. However, it seems likely that there is some range of parameters which is okay: arguably, the optimization power needed to produce a natural language description of a pivotal act[3] is not as big as the optimization power required for Lovecraftian utterances. Indeed, it seems unlikely we could design Lovecraftian utterances ourselves in the foreseeable future, so if solving alignment is not easier then it's hopeless anyway.

To address this, I propose the following "autocalibration" procedure. Let be a sequence of quantilization parameters, starting from extremely aggressive (definitely safe) and then becoming gradually less aggressive (stronger optimization, weaker safety) s.t. we can be reasonably sure it doesn't skip over the target range. For each , we consider an -quantilized debate about the question "is replacing -quantilized debate by -quantilized debate a good idea?". We can then imagine executing this sequence of debates until the point where the answer switches from "yes" to "no".

Now, this sequence might be impractically time consuming. To address this, we can introduce a prediction system which predicts the outcomes of debates without having to actually run them. Notice that this is more viable than e.g. amplifying human imitation via prediction, because in this case we are predicting a low-dimensional signal whereas with imitation we have a high-dimensional signal with a much lower-dimensional but unknown subspace of relevant features. Ofc this system has to be hardened[2:1] against daemons. Once the optimal quantilization parameter has been determined, we can use it for a debate about the pivotal act.

There are various further nuances that should be considered:

  • Various ways to train the debaters can be considered, e.g. starting by generating good models of human judges[4] by passive observation and fine-tuning via actual debate.
  • The quantilized game is not exactly zero-sum so we need to rule out superrational cooperation, either by designing the debaters appropriately or via a unitary AI that is designed to search for Nash equilibria.
  • Instead of a discrete sequence, we can imagine continuous fine-tuning of the parameter i.e. finding the most aggressive quantilization parameter which is self-endorsing.
  • Various way to operationalize and structure the debates might be considered e.g. the 4 player system [LW(p) · GW(p)].

Finally, we also need to deal with non-Cartesian daemons. For this, let's consider a "Turing" setting, i.e. a machine learning "core" connected to an external computer "envelope". We assume that the core is not powerful enough to host a non-Cartesian daemon all by itself. We then protect ourselves by applying quantilization to the envelope actions as well. We might also be able to design a more detailed prior about dangerous programs than about Lovecraftian utterances. This requires to autocalibrate at least two parameters, which seems doable: for example, increase both parameters together until you hit the Pareto frontier, then optimize along the frontier. A weakness in this approach is that, for very aggressive quantilization the prediction system will become too weak to be useful, but perhaps this can be compensated by the more detailed prior.


  1. Without anything like factored cognition, which I am skeptical about. ↩︎

  2. To deal with Cartesian daemons we (i) only make predictions about the counterfactual in which the prediction wasn't made (ii) only produce predictions when they pass a confidence threshold (iii) use a physicalist prior (to be defined in a future post). [EDIT: Actually, this doesn't work [LW(p) · GW(p)]] Non-Cartesian daemons are addressed below. ↩︎ ↩︎

  3. Including instructions for designing a different aligned AI. ↩︎

  4. This entire approach is the polar opposite's of MIRI's "avoid human models" strategy, nevertheless it seems viable. ↩︎

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2021-11-24T23:07:33.521Z · LW(p) · GW(p)

I'm not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.

Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn't about defining the border or formulating legible strategies for what to do about optimization when approaching the border. For example, one might try for shutdown, interrupt-for-oversight, or getting-back-inside-the-borders when optimization pushes the system outside, which is not quantilization. (Getting-back-inside-the-borders might even have weird-x-risk prevention as a convergent drive, but will oppose corrigibility. Some version of oversight/amplification might facilitate corrigibility.)

Debate seems more useful for amplification, extrapolating concepts in a way humans would, in order to become acceptable proxies in wider scopes, so that more and more debates become non-lovecraftian. This is a different concern from setting up optimization that works with some fixed proxy concepts as given.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-25T09:56:01.125Z · LW(p) · GW(p)

I don't understand what you're saying here.

For debate, goodharting means producing an answer which can be defended successfully in front of the judge, even in the face of an opponent pointing out all the flaws, but which is nevertheless bad. My assumption here is: it's harder to produce such an answer than producing a genuinely good (and defensible) answer. If this assumption holds, then there is a range of quantilization parameters which yields good answers.

For the question of "what is a good plan to solve AI risk", the assumption seems solid enough since we're not worried about coming across such deceptive plans on our own, and it's hard to imagine humans producing one even on purpose. To the extent our search for plans relies mostly on our ability to evaluate arguments and find counterarguments, it seems like the difference between the former and the latter is not great anyway. This argument is especially strong if we use human debaters as baseline distribution, although in this case we are vulnerable to same competitiveness problem as amplified-imitation, namely that reliably predicting rich outputs might be infeasible.

For the question of "should we continue changing the quantilization parameter", the assumption still holds because the debater arguing to stop at the given point can win by presenting a plan to solve AI risk which is superior to continuing to change the parameter.

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2021-12-02T21:10:06.968Z · LW(p) · GW(p)

Goodharting is about what happens in situations where "good" is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it's better-defined, and situations where it's ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it's better-defined. I took "lovecraftian" as a proxy for situations where it's ill-defined, and base distribution of quantilization that's intended to oppose goodharting acts as a quantitative description of where it's taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can't be reliably judged.

The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.

My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn't address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that's strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?

The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-02T22:20:43.901Z · LW(p) · GW(p)

The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.

The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it's an ill-posed concept [LW(p) · GW(p)]. I'm not sure how you imagine corrigibility in this case: AQD is a series of discrete "transactions" (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The "out of scope" case is also dodged by quantilization, if I understand what you mean by "out of scope".

...fiddling with base distribution and proxy utility is a more natural framing that's strictly more general than fiddling with the quantilization parameter.

Why is it strictly more general? I don't see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.

If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?

The reason to pick the quantilization parameter is because it's hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.

I don't understand which "main issues" you think this doesn't address. Can you describe a concrete attack vector?


  1. If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition). ↩︎

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-30T23:03:22.324Z · LW(p) · GW(p)

In Hanson’s futarchy, the utility function of the state is determined by voting but the actual policy is determined by a prediction market. But, voting incentivizes misrepresenting your values to get a larger share of the pie. So, shouldn’t it be something like the VCG mechanism instead?

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-23T07:10:22.142Z · LW(p) · GW(p)

Here's an idea about how to formally specify society-wide optimization, given that we know the utility function of each individual. In particular, it might be useful for multi-user AI alignment.

A standard tool for this kind of problem is Nash bargaining. The main problem with it is that it's unclear how to choose the BATNA (disagreement point). Here's why some simple proposals don't work:

  • One natural BATNA for any game is assigning each player their maximin payoff. However, for a group of humans it means something horrible: Alice's maximin is a situation in which everyone except Alice are doing their best to create the worst possible world for Alice. This seems like an unhealthy and unnatural starting point.
  • Another natural BATNA is the world in which no humans exist at all. The problem with this is: suppose there is one psychopath who for some reason prefers everyone not to exist. Then, there are no Pareto improvements over the BATNA, and therefore this empty world is already the "optimum". The same problem applies to most choices of BATNA.

Here is my proposal. We define the socially optimal outcome by recursion over the number of people . For , we obviously just optimize the utility function of the lone person. For a set of people  of cardinality , let's consider any given . The BATNA payoff of  is defined to be the minimum over all  of the payoff of  in the socially optimal outcome of  (we consider worlds in which  doesn't exist). If there are multiple optimal outcomes, we minimize over them. Typically, the minimum is achieved for  but we can't just set  in the definition, we need the minimization in order to make sure that the BATNA is always admissible[1]. We then do Nash bargaining with respect to this BATNA.

Good properties of this proposal:

  • The outcome is Pareto efficient. It is also "fair" in the sense that the specification is rather natural and symmetric.
  • The only especially strong assumption needed to make sense of the definition, is the ability to consider worlds in which some people don't exist[2]. For example, we don't need anything like transferable utility or money. [EDIT: See child comment [LW(p) · GW(p)] for a discussion of removing this assumption.]
  • AFAICT threats don't affect the outcome, since there's no reference to minimax or Nash equilibria.
  • Most importantly, it is resistant to outliers:
    • For example, consider a world with a set  of selfish people and  psychopath who we denote . The outcome space is : each person either exists or not. A selfish person gets payoff  for existing and payoff  for non-existing. The psychopath's payoff is minus the number of people who exists. Let  be the cardinality of . Then, we can check that the socially optimal outcome gives each selfish person a payoff of  (i.e. they exist with this probability).
    • In the above example, if we replace the selfish people with altruists (whose utility function is the number of altruists that exist) the outcome is even better. The expected number of existing altruists is .
  1. ^

    "Admissible" in the sense that, there exists a payoff vector which is a Pareto improvement over the BATNA and is actually physically realizable.

  2. ^

    We also need to assume that we can actually assign utility functions to people, but I don't consider it a "strong assumption" in this context.

Replies from: Dagon, Vladimir_Nesov, vanessa-kosoy
comment by Dagon · 2023-05-23T16:44:45.296Z · LW(p) · GW(p)

Using Nash with maximin as the BATNA has some big advantages

  1. it really motivates bargaining, as there are usually pareto improvements that are obvious, and near-pareto improvements beyond even that. 
  2. It's literally impossible to do worse for any given individual.  If you're worried about the experience of the most unlucky/powerless member, this ensures you won't degrade it with your negotiation.

I'm trying to compare your proposal to https://en.wikipedia.org/wiki/Shapley_value.  On the surface, it seems similar - consider sub-coalitions to determine counterfactual contribution (doesn't matter what the contribution unit is - any linearly aggregatable quantity, whether Utility or dollars should work).

I do worry a bit that in both Shapely and your system, it is acceptible to disappear people - the calculation where they don't exist seems problematic when applied to actual people.  It has the nice property of ignoring "outliers" (really, negative-value lives), but that's only a nice property in theory, it would be horrific if actually applied.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-23T17:12:12.362Z · LW(p) · GW(p)

it really motivates bargaining, as there are usually pareto improvements that are obvious, and near-pareto improvements beyond even that.

I couldn't really parse this. What does it mean to "motivate bargaining" and why is it good?

If you're worried about the experience of the most unlucky/powerless member, this ensures you won't degrade it with your negotiation.

In practice, it's pretty hard for a person to survive on their own, so usually not existing is at least as good as the minimax (or at least it's not that much worse). It can actually be way, way better than the minimax, since the minimax implies every other person doing their collective best to make things as bad as possible for this person.

I'm trying to compare your proposal to https://en.wikipedia.org/wiki/Shapley_value. On the surface, it seems similar

There is a huge difference: Shapley value assumes utility is transferable, and I don't.

I do worry a bit that in both Shapely and your system, it is acceptible to disappear people - the calculation where they don't exist seems problematic when applied to actual people. It has the nice property of ignoring "outliers" (really, negative-value lives), but that's only a nice property in theory, it would be horrific if actually applied.

By "outliers" I don't mean negative-value lives, I mean people who want everyone else to die and/or to suffer. 

It is not especially acceptable in my system to disappear people: it is an outcome that is considered, but it only happens if enough people have a sufficiently strong preference for it. I do agree it might be better to come up with a system that somehow discounts "nosy" preferences, i.e. doesn't put much weight on what Alice thinks Bob's life should look like when it contradicts what Bob wants.

Replies from: Dagon
comment by Dagon · 2023-05-23T17:23:10.247Z · LW(p) · GW(p)

By "motivate bargaining", I meant that humans aren't rational utility maximizers, and the outcomes they will seek and accept are different, depending on the framing of the question.  If you tell them that the rational baseline is low (and prove it using a very small set of assumptions), they're more likely to accept a wider range of better (but not as much better as pure manipulation might give them) outcomes.

By negative-value lives, I meant negative to the aggregate you're maximizing, not negative to themselves.  Someone who gains by others' suffering necessarily reduces the sum.   The assumption that not existing is an acceptable outcome to those participants still feels problematic to me, but I do agree that eliminating unpleasant utility curves makes the problem tractable.  

comment by Vladimir_Nesov · 2023-05-23T15:47:18.955Z · LW(p) · GW(p)

When people are basic ontological entities for a decision theory, there is an option of setting up platonic worlds/environments for them and for interactions between their collections. This needs to add up to what happens in the physical world, but the intermediate constructions can run wild with many abstract/platonic/simulated worlds, for purposes of being valued by their preferences.

I didn't get anything specific/nice this way, but it's the way I'm thinking about boundaries, that agent's viscera should be its own sovereign/private platonic world rather than something like a region of space that's shared with other agents, or agent's own internal details. And the physical world, or other worlds defined for interaction between agents, serve as boundaries between the agents, by virtue of reasoning about them and their viscera worlds in restricted ways that the boundary worlds as a whole precommit to respect.

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-23T07:46:15.410Z · LW(p) · GW(p)

It is possible to get rid of the need to consider worlds in which some players don't exist, by treating  as optimization for a subset of players. This can be meaningful in the context of a single entity (e.g. the AI) optimizing for the preferences of , or in the context of game-theory, where we interpret it as having all players coordinate in a manner that optimizes for the utilities of  (in the latter context, it makes sense to first discard any outcome that assigns a below-minimax payoff to any player[1]). The disadvantage is, this admits BATNAs in which some people get worse-than-death payoffs (because of adversarial preferences of other people). On the other hand, it is still "threat resistant" in the sense that, the mechanism itself doesn't generate any incentive to harm people.

It would be interesting to compare this with Diffractor's ROSE point [LW · GW].

  1. ^

    Regarded as a candidate definition for a fully-general abstract game-theoretic superrational optimum, this still seems lacking, because regarding the minimax in a game of more than two players seems too weak. Maybe there is a version based on some notion of "coalition minimax".

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-03T14:34:58.350Z · LW(p) · GW(p)

Until now I believed that a straightforward bounded version of the Solomonoff prior cannot be the frugal universal prior [AF · GW] because Bayesian inference under such a prior is NP-hard. One reason it is NP-hard is the existence of pseudorandom generators. Indeed, Bayesian inference under such a prior distinguishes between a pseudorandom and a truly random sequence, whereas a polynomial-time algorithm cannot distinguish between them. It also seems plausible that, in some sense, this is the only obstacle: it was established that if one-way functions don't exist (which is equivalent to pseudorandom generators not existing), computing time-bounded Kolomogorov complexity is polynomial-time in the average-case[1].

However, if pseudorandom sequences are truly the only obstacle, then this problem seems remarkably similar to the password game [AF · GW]. Indeed, correctly predicting a pseudorandom sequence requires extracting its seed, which is a piece of completely structureless random information similar to a password. This leads to the following bold conjecture: what if, it is not only statistically, but also computationally feasible to achieve an effective epistemic regret bound [AF · GW] for a bounded Solomonoff prior? (Assuming some computationally bounded theory of algorithmic statistics.) 

Arguably, a pseudorandom sequence with a fixed seed cannot rule this out because the seed length would count for time-bounded Kolomogorov complexity but not for time-bounded sophistication (whatever the latter means), and hence the regret bound would have a penalty exponential in the length of the seed, accounting for the computational difficulty of extracting it. A pseudorandom sequence with a random seed also cannot rule this out, because, while sampling such a sequence is easy, predicting it based on past observations is hard, so we are penalized by its superpolynomial time-bounded Kolmogorov complexity (for the right notion of "time-bounded").

  1. ^

    Admittedly, the fact it's only average-case makes the evidence a lot weaker.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-18T13:22:25.051Z · LW(p) · GW(p)

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to reason). Of course with bounded algorithms the convergence will be faster, something like the inverse bounded-busy-beaver, but still very slow. Any learning algorithm with inductive bias towards simplicity will have generalization failures when coming across the faultlines that carve reality at the joints, at every new level of the domain hierarchy.

This has an important consequence for alignment: in order to stand a chance, any alignment protocol must be fully online, meaning that whatever data sources it uses, those data sources must always stay in the loop, so that the algorithm can query the data source whenever it encounters a faultline. Theoretically, the data source can be disconnected from the loop at the point when it's fully "uploaded": the algorithm unambiguously converged towards a detailed accurate model of the data source. But in practice the convergence there will be very slow, and it's very hard to know that it already occurred: maybe the model seems good for now but will fail at the next faultline. Moreover, convergence might literally never occur if the machine just doesn't have the computational resources to contain such an upload (which doesn't mean it doesn't have the computational resources to be transformative!)[1]

This is also a reason for pessimism regarding AI outcomes. AI scientists working through trial and error will see the generalization failures becoming more and more rare, with longer and longer stretches of stable function in between. This creates the appearance of increasing robustness. But, in reality robustness increases very slowly. We might reach a stable stretch between "subhuman" and "far superhuman" and the next faultline will be the end.


  1. In the Solomonoff analogy, we can imagine the real data source as a short but prohibitively expensive program, and the learned model of the data source as an affordable but infinitely long program: as time progresses, more and more bits of this program will be learned, but there will always be bits that are still unknown. Of course, any prohibitively expensive program can be made affordable by running it much slower than real-time, which is something that Turing RL [LW(p) · GW(p)] can exploit, but at some point this becomes impractical. ↩︎

Replies from: crabman, mikkel-wilson, ofer
comment by philip_b (crabman) · 2020-08-19T05:49:31.073Z · LW(p) · GW(p)

An alignment-unrelated question: Can we, humans, increase the probability that something weird happens in our spacetime region (e.g., the usual laws of physics stop working) by making it possible to compress our spacetime location? E.g., by building a structure that is very regular (meaning that its description can be very short) and has never been built before in our space region, something like make a huge perfectly aligned rectangular grid of hydrogen atoms, or something like that.


It's like a magical ritual for changing the laws of physics. This gives a new meaning to summoning circles, pentagrams, etc.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-19T13:31:38.881Z · LW(p) · GW(p)

We can rephrase your question as follows: "Can we increase the probability of finding an error in the known laws of physics by performing an experiment with a simple property that never happened before, either naturally or artificially"? And the answer is: yes! This is actually what experimental physicists do all the time: perform experiments that try to probe novel circumstances where it is plausible (Occam-razor-wise) that new physics will be discovered.

As to magical rituals, sufficiently advanced technology is indistinguishable from magic :)

comment by MikkW (mikkel-wilson) · 2020-08-20T03:13:21.731Z · LW(p) · GW(p)

I have a sense that similar principles are at play with Spaced Repetition, and that pointing out that connection may be relevant to effectively handling this issue

comment by Ofer (ofer) · 2020-08-19T18:59:44.072Z · LW(p) · GW(p)

convergence might literally never occur if the machine just doesn’t have the computational resources to contain such an upload

I think that in embedded settings (with a bounded version of Solomonoff induction) convergence may never occur, even in the limit as the amount of compute that is used for executing the agent goes to infinity. Suppose the observation history contains sensory data that reveals the probability distribution that the agent had, in the last time step, for the next number it's going to see in the target sequence. Now consider the program that says: "if the last number was predicted by the agent to be 0 with probability larger than then the next number is 1; otherwise it is 0." Since it takes much less than bits to write that program, the agent will never predict two times in a row that the next number is 0 with probability larger than (after observing only 0s so far).

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-04-10T17:04:36.009Z · LW(p) · GW(p)

Epistemic status: moderately confident, based on indirect evidence

I realized that it is very hard to impossible to publish an academic work that takes more than one conceptual inferential step away from the current paradigm. Especially when the inferential steps happen in different fields of knowledge.

You cannot publish a paper where you use computational learning theory to solve metaphysics, and then use the new metaphysics to solve the interpretation of quantum mechanics. A physics publication will not understand the first part, or even understand how it can be relevant. As a result, they will also fail to understand the second part. A computer science publication will not understand or be interested in the second part.

Publishing the two parts separately one after the other also won’t work. The first part might be accepted, but the reviewers of the second part won’t be familiar with it, and the same problems will resurface. The only way to win seems to be: publish the first part, wait until it becomes widely accepted, and only then publish the second part.

Replies from: Dagon, Pattern
comment by Dagon · 2020-04-10T17:54:14.420Z · LW(p) · GW(p)

Hmm. I think I need more detail on your model of publishing and wide-acceptance and their relationship to truth. It seems likely that unless they're circularly dependent, you can publish the smaller-departure in parallel with exploring the further implications in different journals, and in research agendas rather than results publication.

comment by Pattern · 2020-04-12T23:01:12.204Z · LW(p) · GW(p)

So there's journals of X, Y, and Z, but not XYZ?

(In hindsight this sounds obvious, though the only obvious alternatives would be

  • it's hard, but the hardness is in figuring out which place can handle the combination/complexity
  • Publishing anything is hard (or there's a limit to the time/space allocated per month, and timing matters)
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-21T16:04:01.467Z · LW(p) · GW(p)

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be unlearnable, meaning that no algorithm has Bayesian regret going to zero in the limit. The only obvious natural requirement for RL agents in this case is approximating Bayes-optimality. However, Bayes-optimality is not even "weakly feasible": it is NP-hard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?

Although a generic prior with traps is unlearnable, some priors with traps are learnable. Indeed, it can happen that it's possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NP-hard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors that are efficiently decidable.

In particular, consider the following setting, that I call "expanding safety envelope" (XSE). Assume that each hypothesis in the prior is "decorated" by a set of state-action pairs s.t. (i) any is safe, i.e. the leading term of in the expansion is maximal (ii) for each , there is s.t. is Blackwell-optimal for (as a special case we can let contain all safe actions). Imagine an agent that takes random actions among those a priori known to be in . If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in . We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it's possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.

The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSE-learnable? (ii) given an arbitrary decorated prior, is it possible to find the maximal-probability-mass set of hypotheses, which is XSE-learnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sum-of-squares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.

This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worst-case mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are "quasi-traps": not actual traps, but trap-like on short timescales. This suggests that applying an approach like XSE to quasi-traps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would imply learning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.

Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to "not wander too far" in state space, in the simplest case because the entire state space is small. But, instead of "wandering far" from a fixed place in state space, we can constrain "wandering far" w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasi-irreversible departures from this trajectory.

In multi-armed bandits and RL theory, there is a principle known as "optimism in the face of uncertainty". This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: "optimism in the face of cheap-to-resolve uncertainty, pessimism in the face of expensive-to-resolve uncertainty". Following this principle doesn’t lead to actual Bayes-optimality, but perhaps it is in some sense a good enough approximation.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-11T16:56:01.541Z · LW(p) · GW(p)

In the past I considered the learning-theoretic approach to AI theory [AF · GW] as somewhat opposed to the formal logic approach popular in MIRI [AF · GW] (see also discussion [AF(p) · GW(p)]):

  • Learning theory starts from formulating natural desiderata for agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.
  • Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.
  • Learning theory focuses on objects that are observable or finite/constructive, whereas logic-AI often considers objects that unobservable, infinite and unconstructive (which I consider to be a philosophical error).
  • Learning theory emphasizes induction whereas logic-AI emphasizes deduction.

However, recently I noticed that quasi-Bayesian reinforcement learning [AF(p) · GW(p)] and Turing reinforcement learning [AF(p) · GW(p)] have very suggestive parallels to logic-AI. TRL agents have beliefs about computations they can run on the envelope: these are essentially beliefs about mathematical facts (but, we only consider computable facts and computational complexity plays some role there). QBRL agents reason in terms of hypotheses that have logical relationships between them: the order on functions corresponds to implication, taking the minimum of two functions corresponds to logical "and", taking the concave hull of two functions corresponds to logical "or". (but, there is no "not", so maybe it's a sort of intuitionist logic?) In fact, fuzzy beliefs form a continuous dcpo [AF(p) · GW(p)], and considering some reasonable classes of hypotheses probably leads to algebraic dcpo-s, suggesting a strong connection with domain theory (also, it seems like considering beliefs within different ontologies leads to a functor from some geometric category (the category of ontologies) to dcpo-s).

These parallels suggest that the learning theory of QBRL/TRL will involve some form of deductive reasoning and some type of logic. But, this doesn't mean that QBRL/TRL is redundant w.r.t. logic AI! In fact, QBRL/TRL might lead us to discover exactly which type of logic do intelligent agents need and what is the role logic should play in the theory and inside the algorithms (instead of trying to guess and impose the answer ad hoc, which IMO did not work very well so far). Moreover, I think that the type of logic we are going to get will be something finitist/constructivist, and in particular this is probably how Goedelian paradoxes will be avoid. However, the details remain to be seen.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-26T13:04:24.146Z · LW(p) · GW(p)

I recently realized that the formalism of incomplete models [AF · GW] provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maximin value in pure strategies. (The stochastic version can be regarded as a special case of the deterministic version where the agent has access to an external random number generator that is hidden from the rest of the environment according to the hypothesis.) To every decision problem, we can now correspond an incomplete hypothesis as follows. Every time Omega makes a prediction about the agent's future action in some counterfactual, we have Murphy make a guess instead. This guess cannot be directly observed by the agent. If the relevant counterfactual is realized, then the agent's action renders the guess false or true. If the guess is false, the agent receives infinite (or, sufficiently large) reward. If the guess is true, everything proceeds as usual. The maximin value then corresponds to the scenario where the guess is true and the agent behaves as if its action controls the guess. (Which is exactly what FDT and its variants try to achieve.)

For example, consider (repeated) counterfactual mugging. The incomplete hypothesis is a partially observable stochastic game (between the agent and Murphy), with the following states:

  • : initial state. Murphy has two actions: (guess the agent will pay), transitioning to and (guess the agent won't pay) transitioning to . (Reward = )
  • : Murphy guessed the agent will pay. Transitions to or with probability to each (the coin flip). (Reward = )
  • : Murphy guessed the agent won't pay. Transitions to or with probability to each (the coin flip). (Reward = )
  • : Agent receives the prize. Transitions to . (Reward = )
  • : Agent is asked for payment. Agent has two actions: (pay) transitioning to and (don't pay) transitioning to . (Reward = )
  • : Agent receives nothing. Transitions to . (Reward = )
  • : Agent is asked for payment. Agent has two actions: (pay) transitioning to and (don't pay) transitioning to . (Reward = )
  • : Murphy's guess remained untested. Transitions to . (Reward = )
  • : Murphy's guess was right, agent paid. Transitions to . (Reward = )
  • : Murphy's guess was right, agent didn't pay. Transitions to . (Reward = )
  • : Murphy's guess was wrong, agent paid. Transitions to . (Reward = )
  • : Murphy's guess was wrong, agent didn't pay. Transitions to . (Reward = )

The only percepts the agent receives are (i) the reward and (ii) whether it is asked for payment or not. The agent's maximin policy is paying, since it guarantees an expected reward of per round.

We can generalize this to an imperfect predictor (a predictor that sometimes makes mistakes), by using the same construction but adding noise to Murphy's guess for purposes other than the guess's correctness. Apparently, We can also generalize to the variant where the agent can randomize against Omega and Omega decides based on its predictions of the probabilities. This, however, is more complicated. In this variant there is no binary notion of "right" and "wrong" guess. Instead, we need to apply some statistical test to the guesses and compare it against a threshold. We can then consider a family of hypotheses with different thresholds, such that (i) with probability , for all but some finite number of thresholds, accurate guesses would never be judged wrong by the test (ii) with probability , consistently inaccurate guesses will be judged wrong by the test, with any threshold.

The same construction applies to logical counterfactual mugging, because the agent cannot distinguish between random and pseudorandom (by definition of pseudorandom). In TRL [AF(p) · GW(p)] there would also be some family of programs the agent could execute s.t., according the hypothesis, their outputs are determined by the same "coin flips" as the offer to pay. However, this doesn't change the optimal strategy: the "logical time of precommitment" is determined by the computing power of the "core" RL agent, without the computer "envelope".

Replies from: Nisan, cousin_it
comment by Nisan · 2019-11-07T22:28:27.972Z · LW(p) · GW(p)

My takeaway from this is that if we're doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.

How would you handle Agent Simulates Predictor? Is that what TRL is for?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-08T16:34:54.316Z · LW(p) · GW(p)

That's about right. The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined (that's why people have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally ("rationally" according to the FDT philosophy).

TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling [AF(p) · GW(p)] in order to make optimal use of "thinking time" and "interacting with environment time" (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP [AF(p) · GW(p)]. TRL agents can learn whether it's better to be predictable or predicting.

Replies from: Chris_Leong
comment by Chris_Leong · 2019-12-05T15:18:55.692Z · LW(p) · GW(p)

"The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined" - What do you mean here?

I'm curious whether you're referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors [LW · GW]. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait's Hitchhiker to end up in town if the predictor is perfect, so that it wouldn't actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-12-05T15:46:54.665Z · LW(p) · GW(p)

It is not a mere "concern", it's the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and "objective" description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it's supposed to work is, the agent needs to locate all copies of itself or things "logically correlated" with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.

Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).

Replies from: Chris_Leong
comment by Chris_Leong · 2019-12-05T16:01:32.781Z · LW(p) · GW(p)

Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.

comment by cousin_it · 2019-11-13T11:10:35.136Z · LW(p) · GW(p)

But in Newcomb's problem, the agent's reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren't you open to the charge of formalizing the wrong problem?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-13T13:23:55.781Z · LW(p) · GW(p)

The point is, if you put this "quasi-Bayesian" agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you're judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.

Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent's point of view, it would reach Nirvana if it dodged the predictor. From Omega's point of view, if Omega two-boxed and the agent one-boxed, the agent's reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual "Omega makes an error of prediction" is ill-defined, it's conditioning on an event of probability 0.

Replies from: cousin_it, Linda Linsefors
comment by cousin_it · 2019-11-13T16:16:24.278Z · LW(p) · GW(p)

Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb's problem, but instead of saying "Omega is almost always right" we add another person Bob (maybe the mad scientist who built Omega) who's willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you'll randomize seems a bit dodgy as well.

Replies from: vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-30T13:06:02.593Z · LW(p) · GW(p)

Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs [AF(p) · GW(p)], the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form

Here, doesn't have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in so any that satisfies this will also satisfy it for the concave hull of .

What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is

But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on , but this is not a problem for the basics of the formalism).

What if we want our policy to be at least as good as some fixed policy ? Then the desideratum is

It still has the same form!

Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:

To achieve this, we postulate a predictor that guesses the policy, producing the guess , and define the fuzzy belief using the function (we assume the guess is not influenced by the agent's actions so we don't need in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.

In particular, this captures self-referential desiderata of the type "the policy cannot be improved by changing it in this particular way". These are of the form:

It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting to for policies outside the space.

The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn't impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-13T17:36:40.891Z · LW(p) · GW(p)

Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more "philosophical" defense of maximin is possible, analogous to VNM / complete class theorems, but I don't know (I actually saw some papers in that vein but haven't read them in detail.)

If the agent has random bits that Omega doesn't see, and Omega is predicting the probabilities of the agent's actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven't worked out the details. Specifically, I think that we can define some function that depends on the agent's actions and Omega's predictions so far (a measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of over time is finite with probability 1. Then, we consider consider a family of models, where model number says that for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.

EDIT 1: I think should be something like, how much money would a gambler following a particular strategy win, betting against Omega.

EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses dollar. Every time the agent one-boxes, the gambler wins dollars, where is the probability Omega assigned to one-boxing. Now it's possible to see that one-boxing guarantees the "CC" payoff under the corresponding model (in the limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace "one-boxes" by "follows the FDT strategy".

comment by Linda Linsefors · 2019-11-13T14:09:50.217Z · LW(p) · GW(p)

I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-13T14:13:34.768Z · LW(p) · GW(p)

The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if "idealized Omega" is wrong.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-30T19:55:38.445Z · LW(p) · GW(p)

One of the central challenges in Dialogic Reinforcement Learning [AF(p) · GW(p)] is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences [AF(p) · GW(p)] to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history corresponds to its own player. The action space of each player is just . An outcome of such a game can be also thought of as a policy for the AI. The payoff of a player is expected utility (for this player's reward function) w.r.t. the probability measure resulting from plus the current belief state of the user conditional on , ( is the set of possible "realities"). We then define regret as the sum of Bellman errors w.r.t. equilibrium value of the players that actually manifested (so that in equilibrium it is zero). Bayesian regret requires taking expected value w.r.t some "ur-prior" that the AI starts with. Note that:

  • For a user that updates its beliefs on the AI's observations according the Bayes' theorem, the regret per reality is the same as subjective regret. Bayesian regret is also the same if the ur-prior assumes the user's beliefs are calibrated (which in the more general case is not a necessary assumption). The same applies to a user that doesn't updates eir beliefs at all.

  • The user beliefs are part of the ontology . Therefore, the system takes into accounts the user's beliefs about the evolution of the user's beliefs. So, the equilibrium policy is incentivized to empower its future self to the extent that the user believes that eir own beliefs will become more accurate over time (given fixed reward function, see below).

  • contains a distinct reward function for each player. And, the user may have uncertainty even over eir own current reward function. Therefore, the system distinguishes two types of value modifications: "legitimate" modifications that consist of improving one's beliefs about the reward function and "illegitimate" modification that consist of the reward function actually changing. The equilibrium policy is incentivized to encourage the first type and avoid the second type.

Replies from: vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy, vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-09T15:03:29.702Z · LW(p) · GW(p)

There is a deficiency in this "dynamically subjective" regret bound (also can be called "realizable misalignment" bound) as a candidate formalization of alignment. It is not robust to scaling down [AF · GW]. If the AI's prior allows it to accurately model the user's beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user's beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models [AF · GW] to capture some properties of the user's beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn't do it.

To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don't change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize "not fail catastrophically" I propose the following definition.

Let's start with the case when the user's preferences and beliefs are dynamically consistent. Consider some AI-observable event that might happen in the world. Consider a candidate learning algorithm and two auxiliary policies. The policy follows the baseline policy until happens, at which time it switches to the subjectively optimal policy. The policy follows the candidate learning algorithm until happens, at which time it also switches to the subjectively optimal policy. Then, the "-dangerousness" of is defined to be the expected utility of minus the expected utility of . Thus, when -incorrigibility is zero or negative, does no worse than .

Why do we need ? Because without the criterion would allow policies that don't damage the present but permanently destroy opportunities that could be used by a future better AI.

In the dynamically consistent case, incorrigibility can be represented as an expected sum over time-before- of Bellman errors w.r.t the value function of . This allows us generalizing it to the dynamically inconsistent case, by writing a similar expression except that each Bellman error term uses the transient preferences and beliefs of the user at the given moment.

Is it truly possible to have a reasonable bound on -dangerousness for all , and is it possible to do so while maintaining a reasonable realizable misalignment bound? It seems possible, for the following reason. The user's beliefs can be represented as a mapping from questions to answers(fn1). If you sample questions from any fixed distribution, then by verifying that you can predict the answers, you gain valid information about the belief state without any prior about the belief state (it is a "frequentist" guarantee). Therefore, the AI can constrain itself to taking only those actions which are known to be safe based on this "robust" information. Since there is no guarantee that the AI will find a model that predicts answers, in the unrealizable case this might leave it without an effective strategy, but even without any information the AI can stay safe by following the baseline.

This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user's stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That's because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period. At the least, this is much more corrigible than CIRL which guarantees nothing in the unrealizable case, and even in the realizable case no general guarantees were obtained (and arguably cannot be obtained since the AI might not have enough information).

This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage. A misalignment bound is only needed to prove the AI will also be highly capable at pursuing the user's goals. The way such a heuristic AI may work, is by producing formal certificates for each action it takes. Then, we need not trust the mechanism suggesting the actions nor the mechanism producing the certificates, as long as we trust the verification of those certificates (which doesn't require AI). The untrustworthy part might still be dangerous if it can spawn non-Cartesian daemons [AF · GW] But, that is preventable using TRL [AF(p) · GW(p)], assuming that the "core" agent has low dangerousness and is too weak to spawn superhuman daemons without the "envelope".

(fn1) In truth, this assumption that the user's answers come from a mapping that changes only slowly is probably unrealistic, because the user need not have coherent beliefs even over short timescales. For example, there might be many pairs of fairly ordinary (non-manipulative) questions s.t. asking them in different order will produce different answers. However, to the extent that the user's beliefs are incoherent, and therefore admit multiple equally plausible interpretations, learning any interpretation should be good enough. Therefore, although the model needs to be made more general, the learning problem should not become substantially more difficult.

Replies from: TurnTrout
comment by TurnTrout · 2020-01-19T02:29:19.720Z · LW(p) · GW(p)

This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user's stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That's because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...

This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.

This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I've made [LW(p) · GW(p)] for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2020-01-19T15:29:37.226Z · LW(p) · GW(p)

There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.

The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.

From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worst-case) prove a negative.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-02T13:41:19.230Z · LW(p) · GW(p)

Dialogic RL assumes that the user has beliefs about the AI's ontology. This includes the environment(fn1) from the AI's perspective. In other words, the user needs to have beliefs about the AI's counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI's counterfactuals from the user's perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer [AF(p) · GW(p)] based on the incomplete models formalism. This answer can be applied in this case also, quite naturally.

Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user's perspective, the AI policy is a user action. Again from the user's perspective, the AI's actions and observations are all part of the outcome. The user's beliefs about the user's counterfactuals can therefore be expressed as (fn2), where is the space of AI policies(fn3). We assume that for every , is consistent with the natural sense. Such a belief can be transformed into an incomplete model from the AI's perspective, using the same technique we used to solve Newcomb-like decision problems, with playing the role of Omega. For a deterministic AI, this model looks like (i) at first, "Murphy" makes a guess that the AI's policy is (ii) The environment behaves according to the conditional measures of (iii) If the AI's policy ever deviates from , the AI immediately enters an eternal "Nirvana" state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user's beliefs are already an incomplete model, by adding another step where Murphy chooses out of some set.

What we constructed is a method of translating counterfactuals from the user's perspective to the AI's perspective. In particular, the AI will inherit the user's level of "updatelessness" (in the sense that, if the user's counterfactuals are defined w.r.t. a particular effective precommitment point, the AI will use the same point). This translation may be implemented either (i) by the user, by explaining these semantics to em or (ii) by the AI, in which case the formal language should refer to the user's counterfactuals rather than the AI's counterfactuals.

(fn1) Up to an equivalence relation, that's a mapping .

(fn2) For infinite AI liftetime. We can trivially generalize this to allow for finite AI lifetime as well.

(fn3) Up to an equivalence relation, they are mappings . We may add computability/complexity constraints and represent them as programs.

Replies from: Gurkenglas
comment by Gurkenglas · 2019-11-02T14:02:11.345Z · LW(p) · GW(p)

Nirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot.

(Conjecture: This can be proven, but only by contradiction.)

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-02T14:52:53.186Z · LW(p) · GW(p)

Maybe? I am not sure that I like Nirvana, but it doesn't seem that bad. If someone thinks of a solution without it, I would be interested.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-31T15:16:55.351Z · LW(p) · GW(p)

Another notable feature of this approach is its resistance to "attacks from the future", as opposed to approaches based on forecasting. In the latter, the AI has to predict some future observation, for example what the user will write after working on some problem for a long time. In particular, this is how the distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster might sample a future in which a UFAI has been instantiated and this UFAI will exploit this to infiltrate the present. This might result in a self-fulfilling prophecy, but even if the forecasting is counterfactual (and thus immune to self-fulfilling prophecies)it can be attacked by a UFAI that came to be for unrelated reasons. We can ameliorate this by making the forecasting recursive (i.e. apply multiple distillation & amplification steps) or use some other technique to compress a lot of "thinking time" into a small interval of physical time. However, this is still vulnerable to UFAIs that might arise already at present with a small probability rate (these are likely to exist since our putative FAI is deployed at a time when technology progressed enough to make competing AGI projects a real possibility).

Now, compare this to Dialogical RL, as defined via the framework of dynamically inconsistent beliefs. Dialogical RL might also employ forecasting to sample the future, presumably more accurate, beliefs of the user. However, if the user is aware of the possibility of a future attack, this possibility is reflected in eir beliefs, and the AI will automatically take it into account and deflect it as much as possible.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-31T00:23:43.683Z · LW(p) · GW(p)

This approach also obviates the need for an explicit commitment mechanism. Instead, the AI uses the current user's beliefs about the quality of future user beliefs to decide whether it should wait for user's beliefs to improve or commit to an irreversible coarse of action. Sometimes it can also predict the future user beliefs instead of waiting (predict according to current user beliefs updated by the AI's observations).

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-31T00:22:14.948Z · LW(p) · GW(p)

(moved to alignment forum)

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-21T15:36:20.122Z · LW(p) · GW(p)

In my previous shortform [AF(p) · GW(p)], I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating the user or even by hacking the software or hardware of the system in some clever way. If the algorithm treats every answer as valid, this creates a perverse incentive: the AI knows that by phrasing the question in a particular way, a certain answer will result, so it will artificially obtain the answers that are preferable (for example answers that produce an easier to optimize utility function). In this interpretation the "attacker" is the AI itself. In order to defend against the vector, we might change the AI's prior so that the AI knows some of the answers are invalid. If the AI has some method of distinguishing valid from invalid answers, that would eliminate the perverse incentive.

In the second interpretation, an attack vector is a vulnerability that can be exploited by malicious hypotheses in the AI's prior. Such a hypothesis is an agent with its own goals (for example, it might arise as a simulation hypothesis). This agent intentionally drives the system to ask manipulative questions to further these goals. In order to defend, we might design the top level learning algorithm so that it only takes action that are safe with sufficiently high confidence (like in Delegative RL). If the prior contains a correct hypothesis along with the malicious hypothesis, the attack is deflected (since the correct hypothesis deems the action unsafe). Such a confidence threshold can usually be viewed as a computationally efficient implementation of the prior shaping described in the previous paragraph.

In the third interpretation, an attack vector is something that impedes you from proving a regret bound under sufficiently realistic assumptions. If your system has an undefended question interface, then proving a regret bound requires assuming that asking a question cannot create irreversible damage. In order to drop this assumption, a defense along the lines of the previous paragraphs has to be employed.

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-18T12:26:33.081Z · LW(p) · GW(p)

The sketch of a proposed solution to the hard problem of consciousness: An entity is conscious if and only if (i) it is an intelligent agent (i.e. a sufficiently general reinforcement learning system) and (ii) its values depend on the presence and/or state of other conscious entities. Yes, this definition is self-referential, but hopefully some fixed point theorem applies. There may be multiple fixed points, corresponding to "mutually alien types of consciousness".

Why is this the correct definition? Because it describes precisely the type of agent who would care about the hard problem of consciousness.

Replies from: cousin_it, gworley, mr-hire, Gurkenglas
comment by cousin_it · 2019-11-13T12:07:24.159Z · LW(p) · GW(p)

I'm not sure your definition has much to do with consciousness, as it would also be satisfied by an AI that runs on an Intel processor and whose utility function says all AIs should run on Intel processors.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-13T13:30:32.562Z · LW(p) · GW(p)

Its utility function would have to say that all conscious AIs should run on Intel processors. There is self-reference there.

But, I only have rather low confidence this idea is correct (what being correct means here) or important.

comment by Gordon Seidoh Worley (gworley) · 2019-10-18T18:13:34.448Z · LW(p) · GW(p)

This seems to me to address the meta problem of consciousness rather than the hard problem of consciousness itself, since you seem to be more offering an etiology for the existence of agents that would care about the hard problem of consciousness rather than an etiology of qualia.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-19T12:03:40.698Z · LW(p) · GW(p)

Yes, but I also claim that the agents that would care about the hard problem of consciousness are exactly the agents that are themselves conscious.

comment by Matt Goldenberg (mr-hire) · 2019-10-18T14:23:11.453Z · LW(p) · GW(p)

I'm trying to figure out what precisely #2 means. How do you define "values"? IE, if I'm a deep learning algorithm in a tournament with other deep learning algorithms, certainly my instrumental values depend on the state of other deep learning algorithms. Is that sufficient in your definition for consciousness?

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2019-10-18T14:40:47.455Z · LW(p) · GW(p)

No, I am talking about terminal values. Something like an instrumental reward function [AF · GW] specified via an ontology that involves consciousness. Also, I am not sure deep learning in its present form qualifies as sufficiently "general" for (i).

comment by Gurkenglas · 2019-10-21T15:02:35.484Z · LW(p) · GW(p)

Your definition says that people's models of other people can be conscious, doesn't it?

comment by Vanessa Kosoy (vanessa-kosoy) · 2021-11-26T12:43:39.042Z · LW(p) · GW(p)

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space , label space , distribution and loss function . Similarly, domain E is represented by instance space , label space , distribution and loss function . The distributions are initially unknown and we assume some prior over them: . The prior involves some correlation between D and E, hence learning about D tends to acquire information about E as well.

A learning algorithm for D is (receives a data sample and produces a label prediction function). A learning algorithm for E has access to knowledge generated by : . We can now consider zero-sum games of the following form: you choose an algorithm , the adversary looks at and chooses an algorithm , your payoff decreases with your expected loss and increases with the adversary's expected loss (e.g. it is given by for some parameter ). The expected losses are given by

Here are the sample sizes. The interesting case is or even .

Here's a very simple example. Suppose that , and is s.t. (i) each is assigned a persistent label sampled uniformly at random from independently of other instances (ii) information about the labels doesn't help with E but information about the distribution on does help with E. When we care only about the best we can do is memorize the samples, i.e. set to if and set it to otherwise. However, this would aid the adversary. Instead, we can set to if and set it to a coinflip otherwise. Now our loss is somewhat worse (but, for discrete it still goes to as goes to ) but the adversary gains no information from us!

It is also possible to ignore any knowledge we have about E and just try designing which simultaneously minimizes the mutual information between and and minimizes . Going to an even higher level of abstraction, this is similar to the following problem:

Let be a bipartite graph ( are the left vertices, are the right vertices, are the edges) and a distribution on . Find s.t. (i) for any , and (ii) if we sample from then the mutual information between and is minimal. That is, we are minimizing the following:

It would be interesting to understand the computational complexity of this problem (and/or of relaxations when we're allowed to approximate).

Finally, it is interesting to also impose computational complexity constraints on our (but perhaps not on : obfuscating the learned representation means the knowledge about E is inaccessible from outside but might be still exploitable by the AI itself), in which case we would split it into a representation space , a training algorithm algorithm and a prediction algorithm (both of which have to lie in some low complexity class e.g. ), whereas the signature of becomes .

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2021-12-24T20:21:13.177Z · LW(p) · GW(p)

The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.

Let be our accuracy requirement for the target domain. That is, we want s.t.

Given any , denote to be conditioned on the inequality above, where is regarded as a random variable. Define by

That is, is the Bayes-optimal learning algorithm for domain E w.r.t. prior .

Now, consider some . We regard as a learning algorithm for domain D which undergoes "antitraining" for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that achieves asymptotic accuracy [1], i.e. that if is sampled from then with probability

Under this constraint, we want to be as ignorant as possible about domain E, which we formalize as maximizing defined by

It is actually important to consider because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model's "internal language". For we can get high despite the model actually knowing domain E because the adversary doesn't know the embedding, but for it should be able to learn the embedding much faster than learning domain E from scratch.

We can imagine a toy example where , the projections of and to and respectively are distributions concentrated around two affine subspaces, and the labels are determined by the sign of a polynomial which is the same for and up to a linear transformation which is a random variable w.r.t. . A good would then infer , look for an affine subspace s.t. is near while is far from and fit a polynomial to the projections of on .

More realistically, if the prior is of Solomonoff type, then is probably related to the relative Kolmogorov complexity of w.r.t. .


  1. It might be bad that we're having condition on having accuracy while in reality achieves this accuracy only asymptotically. Perhaps it would be better to define in some way that takes 's convergence rate into consideration. On the other hand, maybe it doesn't matter much as long as we focus on asymptotic metrics. ↩︎

comment by Vanessa Kosoy (vanessa-kosoy) · 2023-11-11T18:09:20.050Z · LW(p) · GW(p)

Here is a way to construct many learnable undogmatic ontologies [LW · GW], including such with finite state spaces.

A deterministic partial environment (DPE) over action set  and observation set  is a pair  where  and  s.t.

  • If  is a prefix of some , then .
  • If  and  is a prefix of , then .

DPEs are equipped with a natural partial order. Namely,  when   and .

Let  be a strong upwards antichain in the DPE poset which doesn't contain the bottom DPE (i.e. the DPE with ). Then, it naturally induces an infra-POMDP. Specifically: 

  • The state space is .
  • The initial infradistribution is .
  • The observation mapping is , where  is the empty history.
  • The transition infrakernel is , where

If  is non-empty for all  and , this is a learnable undogmatic ontology.

Any  yields an example . Namely,  iff  and for any  it holds that:

  1. If  then for any .

I think that for any continuous some non-trivial hidden reward functions over such an ontology, the class of communicating RUMDPs is learnable. If the hidden reward function doesn't depend on the action argument, it's equivalent to some instrumental reward function.

comment by Vanessa Kosoy (vanessa-kosoy) · 2020-08-31T19:40:25.308Z · LW(p) · GW(p)

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).

Currently I only have speculations about the solution. But, I have a few desiderata for it:

Desideratum 1a: The algorithm should guarantee some lower bound on expected utility, compared to what the Bayes-optimal policy gets. We should also have an upper bound for all polynomial time algorithms. The two bounds should not be too far apart.

Desideratum 1b: When it so happens we have no traps, the algorithm should produce asymptotic Bayes optimality with a regret bound close enough to optimal. When there are only "small" traps, the penalty should be proportional.

Problem 2:: In the presence of traps, there is no "frequentist" guarantee (regret bound). We can divide it into subproblems according to different motivations for having such a guarantee in the first place.

Problem 2a: We want such a guarantee as a certificate of safety.

Solution: Require a subjective [AF(p) · GW(p)] regret [AF(p) · GW(p)] bound instead.

Problem 2b: The guarantee is motivated by an "evolutionary" perspective on intelligence: intelligent agents are agents that are successful in the real world, not just in average over all possible worlds.

Solution: Bootstrapping from a safe baseline policy [AF(p) · GW(p)]. For an individual human, the baseline comes from knowledge learned from other people. For human civilization, some of the baseline comes from inborn instincts. For human civilization and evolution both, the baseline comes from locality and thermodynamics: doing random things is unlikely to cause global irreversible damage. For an aligned AI, the baseline comes from imitation learning and quantilization.

Problem 2c: The guarantee is needed to have a notion of "sample complexity", which is such an important concept that it's hard to imagine deconfusion without it. This notion cannot come just from Desideratum 1a since sample complexity should remain non-trivial even given unbounded computational resources.

Solution: A prior consists of a space of hypotheses and a probability measure over this space. We also have a mapping where is the space of environments, which provides semantics to the hypotheses. Bayes-optimizing means Bayes-optimizing the environment . Learnability of means that the Bayesian regret must converge to as goes to . Here is the (normalized to ) value (maximal expected utility) of environment at time discount . Notice that the second term depends only on but the first term depends on and . Therefore, we can ask about the regrets for different decompositions of the same into hypotheses. For some , and s.t. , we can have learnability even when we don't have it for the original decomposition. I think that typically there will be many such decompositions. They live in the convex set surrounding in which the value function becomes affine in the limit. We can say that not all information is learnable, but represents some learnable information. We can then study the regret bound (and thus) sample complexity for a particular or for all possible .

comment by Vanessa Kosoy (vanessa-kosoy) · 2019-11-08T18:14:31.729Z · LW(p) · GW(p)

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL [AF(p) · GW(p)], the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable Bayesian network. A POMDP involves an arbitrary set of states, so it's not an element in a set, and the class of POMDPs can be naturally made into a category. Therefore, there is need for defining the notion of a probability measure over a category. Of course we can avoid this by enumerating the states, considering the set of all possible POMDPs w.r.t. this enumeration and then requiring the probability measure to be invariant w.r.t. state relabeling. However, the category theoretic point of view seems more natural, so it might be worth fleshing out.

Ordinary probably measures are defined on measurable spaces. So, first we need to define the analogue of "measurable structure" (-algebra) for categories. Fix a category . Denote the category of measurable spaces. A measurable structure on is then specified by providing a Grothendick fibration and an equivalence . Here, stands for the essential fiber of over the one point space . The intended interpretation of is, the category of families of objects in indexed by measurable spaces. The functor is supposed to extract the base (index space) of the family. We impose the following conditions on and :

Given , and , we denote the corresponding base change by ( and is canonically isomorphic to ).

  • Consider and . Consider also a point . We can think of as a morphism . This allows us considering the base changes and (the "fibers" of at and at respectively) where . Applying the universal property of to and , we get morphisms . We now require that, if for any , then (morphisms between families that are pointwise equal are just equal).

  • Consider and . Suppose that (i) is an isomorphism and (ii) for any , is an isomorphism. Then, is an isomorphism (families with a common base that are pointwise isomorphic are just isomorphic).

I'm not entirely sure how sufficient or necessary these conditions are for proving useful results, but they seem to me natural at first glance. Note that this definition can be regarded as motivated by the Yoneda lemma: a measurable space is defined by the measurable mappings to from other measurable spaces, so a "measurable category" should be defined by the measurable "mappings" to it from measurable spaces, and is precisely the category of such measurable "mappings". Compare this with definition of geometric stacks(fn1).

Next, we define probability measures. Specifically, for any "measurable category" (a category equipped with structure as above), we construct the category of "probability measures on ". First, we define the auxiliary category . An object in is a pair where is an object in and is a probability measure on . We interpret this as sampling from and then taking (using , the latter can be considered to be an object in ). We define the morphisms from to as those morphisms for which (the notation stands for pushforward). Given , we call it a "quasi-isomorphism" when, for any , is an isomorphism. Claim: quasi-isomorphisms admit a calculus of right fractions(fn2). We now define as the localization of by quasi-isomorphisms.

(fn1) Maybe the analogy with stacks should be made more formal? Not sure, stacks are motivated by topology and measurable spaces are not topological...

(fn2) This should clearly be right, and this is right for natural examples, but I haven't written down the proof. If it turns out to be false it would mean that my conditions on are too weak.