## Posts

## Comments

**vanessa-kosoy**on [Fiction] Lena (MMAcevedo) · 2021-02-24T18:16:11.266Z · LW · GW

Btw the name of the story is a reference to the Lena image.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-21T23:46:48.734Z · LW · GW

It is worth noting that this approach doesn't deal with non-Cartesian daemons, i.e. malign hypotheses that attack via the *physical side effects* of computations rather than their output. Homomorphic encryption is still the best solution I know to these.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-21T22:37:36.999Z · LW · GW

Slide 8 actually points towards a way to use imitation learning to hopefully make a competitive AI: IDA. Yet in this case, I'm not sure that your result implies safety. For IDA isn't a one shot imitation learning problem; it's many successive imitation learning problem. Even if you limit the drift for one step of imitation learning, the model could drift further and further at each distillation step.

I don't think this is a lethal problem. The setting is not one-shot, it's imitation over some duration of time. IDA just increases the effective duration of time, so you only need to tune how cautious the learning is (which I think is controlled by in this work) accordingly: there is a cost, but it's bounded. You also need to deal with non-realizability (after enough amplifications the system is too complex for exact simulation, even if it wasn't to begin with), but this should be doable using infra-Bayesianism (I already have some notion how that would work). Another problem with imitation-based IDA is that external unaligned AI might leak into the system either from the future or from counterfactual scenarios in which such an AI is instantiated. This is not an issue with amplifying by parallelism (like in the presentation) but at the cost of requiring parallelizability.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-20T12:12:06.214Z · LW · GW

Alpha only needs to be set based on a guess about what the prior on the truth is. It doesn't need to be set based on guesses about possibly countably many traps of varying advisor-probability.

Hmm, yes, I think the difference comes from imitation vs. RL. In your setting, you only care about producing a good imitation of the advisor. On the other hand in my settings, I want to achieve near-optimal performance (which the advisor doesn't achieve). So I need stronger assumptions.

I'm not sure I understand whether you were saying ratio of probabilities that the advisor vs. agent takes an unsafe action can indeed be bounded in DL(I)RL.

Well, in DLIRL the probability that the advisor takes an unsafe action on any given round is bounded by roughly , whereas the probability that the agent takes an unsafe action over a duration of is bounded by roughly , so it's not a ratio but there is some relationship. I'm sure you can derive some relationship in DLRL too, but I haven't studied it (like I said, I only worked out the details when the advisor *never* takes unsafe actions).

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-02-19T18:23:19.455Z · LW · GW

I propose to call *metacosmology* the hypothetical field of study which would be concerned with the following questions:

- Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.
- Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.
- Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.
- Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

- It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.
- The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.
- In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-19T17:41:57.195Z · LW · GW

Okay, then I assume the agent's models of the advisor are not exclusively deterministic either?

Of course. I assume realizability, so one of the hypothesis is the true advisor behavior, which is stochastic.

What I care most about is the ratio of probabilities that the advisor vs. agent takes the unsafe action, where don't know as programmers (so the agent doesn't get told at the beginning) any bounds on what these advisor-probabilities are. Can this modification be recast to have that property? Or does it already?

In order to achieve the optimal regret bound, you do need to know the values of and . In DLIRL, you need to know . However, AFAIU your algorithm also depends on some parameter ()? In principle, if you don't know anything about the parameters, you can set them to be some function of the time discount s.t. as the bound becomes true and the regret still goes to . In DLRL, this requires , in DLIRL . However, then you only know regret vanishes with certain asymptotic rate without having a quantitative bound.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-18T23:53:55.000Z · LW · GW

I think this is completely unfair. The inner alignment problem exists even for perfect Bayesians, and solving it in that setting contributes much to our understanding. The fact we don't have satisfactory mathematical models of deep learning performance is a different problem, which is broader than inner alignment and to first approximation orthogonal to it. Ideally, we will solve this second problem by improving our mathematical understanding of deep learning and/or other competitive ML algorithms. The latter effort is already underway by researchers unrelated to AI safety, with some results. Moreover, we can in principle come up with heuristics how to apply this method of solving inner alignment (which I call "confidence thresholds" in my own work) to deep learning: e.g. use NNGP to measure confidence or use an evolutionary algorithm with a population of networks and check how well they agree with each other. Of course if we do this we won't have formal guarantees that it will work, but, like I said this is a broader issue than inner alignment.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-18T23:37:37.086Z · LW · GW

Yes to the first question. In the DLRL paper I assume the advisor takes unsafe actions with probability exactly . However, it is straightforward to generalize the result s.t. the advisor can take unsafe actions with probability , where is the lower bound for the probability to take an optimal action (Definition 8). Moreover, in DLIRL (which, I believe, is closer to your setting) I use a "soft" assumption (see Definition 3 there) that doesn't require any probability to vanish entirely.

No to the second question. In neither setting the advisor is assumed to be deterministic.

**vanessa-kosoy**on Formal Solution to the Inner Alignment Problem · 2021-02-18T18:10:47.942Z · LW · GW

The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.

This is exactly the method used in my paper about delegative RL and also an earlier essay that doesn't assume finite MDPs or access to rewards but has stronger assumptions about the advisor (so it's essentially doing imitation learning + denoising). I pointed out the connection to mesa-optimizers (which I called "daemons") in another essay.

**vanessa-kosoy**on Stuart_Armstrong's Shortform · 2021-02-16T23:44:30.663Z · LW · GW

This is a special case of a crisp infradistribution: is equivalent to , a linear equation in , so the set of all 's satisfying it is convex closed.

**vanessa-kosoy**on Why I Am Not in Charge · 2021-02-08T22:54:42.137Z · LW · GW

In my model, that’s not how someone in her position thinks at all. She has no coherent utility function. She doesn’t have one because, to the extent she ever did have one, it was trained out of her long ago, by people who were rewarding lack of utility functions and punishing those who had coherent utility functions with terms for useful things. The systems and people around her kept rewarding instinctive actions and systems, and punishing intentional actions and goals.

IMO a more accurate model is: such people do have a utility function, but how to use your brain's CPU cycles is part of your strategy. If you're in an environment where solving complex politics is essential to survival, you will spend all your cycles on solving complex politics. Moreover, if your environment gives you little slack then you have to do it myopically because there's no time for long-term planning while you're parrying the next sword thrust. At some point you don't have enough free cycles to re-evaluate your strategy of using cycles, and then you'll keep doing this even if it's no longer beneficial.

**vanessa-kosoy**on Deepmind has made a general inductor ("Making sense of sensory input") · 2021-02-02T10:13:34.404Z · LW · GW

My impression is that it's interesting because it's good at some functions that deep learning is bad at (although unfortunately the paper doesn't make any toe-to-toe comparisons), but certainly there's a lot of things in which transformers would beat it. In particular I would be very surprised if it could reproduce GPT3 or DALL-E. So, if this leads to a major breakthrough it would probably be through merging it with deep learning somehow.

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-02-01T12:23:52.618Z · LW · GW

I find it interesting to build simple toy models of the human utility function. In particular, I was thinking about the aggregation of value associated with other people. In utilitarianism this question is known as "population ethics" and is infamously plagued with paradoxes. However, I believe that is the result of trying to be impartial. Humans are very partial and this allows coherent ways of aggregation. Here is my toy model:

Let Alice be our viewpoint human. Consider all social interactions Alice has, categorized by some types or properties, and assign a numerical weight to each type of interaction. Let be the weight of the interaction person had with person at time (if there was no interaction at this time then ). Then, we can define Alice's *affinity* to Bob as

Here is some constant. Ofc can be replaced by many other functions.

Now, we can the define the *social distance* of Alice to Bob as

Here is some constant, and the power law was chosen rather arbitrarily, there are many functions of that can work. Dead people should probably count in the infimum, but their influence wanes over time since they don't interact with anyone (unless we count consciously thinking about a person as an interaction, which we might).

This is a time-dependent metric (or quasimetric, if we allow for asymmetric interactions such as thinking about someone or admiring someone from afar) on the set of people. If is bounded and there is a bounded number of people Alice can interact with at any given time, then there is some s.t. the number of people within distance from Alice is . We now define the reward as

Here is some constant and is the "welfare" of person at time , or whatever is the source of value of people for Alice. Finally, the utility function is a time discounted sum of rewards, probably not geometric (because hyperbolic discounting is a thing). It is also appealing to make the decision rule to be minimax-regret over all sufficiently long time discount parameters, but this is tangential.

Notice how the utility function is automatically finite and bounded, and none of the weird paradoxes of population ethics and infinitary ethics crop up, even if there is an infinite number of people in the universe. I like to visualize people space a tiling of hyperbolic space, with Alice standing in the center of a Poincare or Beltrami-Klein model of it. Alice's "measure of caring" is then proportional to volume in the *model* (this probably doesn't correspond to exactly the same formula but it's qualitatively right, and the formula is only qualitative anyway).

**vanessa-kosoy**on Belief Functions And Decision Theory · 2021-01-31T19:42:12.665Z · LW · GW

I'm not sure I understood the question, but the infra-Bayesian update is *not* equivalent to updating every distribution in the convex set of distributions. In fact, updating a crisp infra-distribution (i.e. one that can be described as a convex set of distributions) in general produces an infra-distribution that is not crisp (i.e. you need sa-measures to describe it or use the Legendre dual view).

**vanessa-kosoy**on Simulacrum 3 As Stag-Hunt Strategy · 2021-01-27T00:27:47.527Z · LW · GW

I think that all of ethics works like this: we pretend to be more altruistic / intrinsically pro-social than we actually are, even to ourselves. And then there are situations like battle of the sexes, where we negotiate the Nash equilibrium while pretending it is a debate about something objective that we call "morality".

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-01-24T18:10:58.002Z · LW · GW

Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over , and start from testing the identity of the types appearing inside parentheses), so we *can* validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-01-23T17:05:18.119Z · LW · GW

Let's also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).

**0-th order**

*Syntax*

Let be the set of propositional variables. We define the language :

- Any is also in
- Given ,
- Given ,

Notice there's no negation or implication. We define the set of judgements . We write judgements as (" in the context of "). A theory is a subset of .

*Semantics*

Given , a model of consists of a compact Polish space and a mapping . The latter is required to satisfy:

- . Here, we define of infradistributions as intersection of the corresponding sets
- . Here, we define of infradistributions as convex hull of the corresponding sets
- For any ,

**1-st order**

*Syntax*

We define the language using the usual syntax of 1-st order logic, where the allowed operators are , and the quantifiers and . Variables are labeled by types from some set . For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables , we denote the set of formulae whose free variables are a subset of . We define the set of judgements .

*Semantics*

Given , a model of consists of

- For every , a compact Polish space
- For every where have types , an element of

It must satisfy the following:

- Consider variables of types and variables of types . Consider also some s.t. . Given , we can form the substitution . We also have a mapping given by . We require
- Consider variables and . Denote the projection mapping. We require
- Consider variables and . Denote the projection mapping. We require that if an only if, for all s.t ,
- For any ,

**vanessa-kosoy**on Meetup Organizers, Our Virtual Garden is at Your Disposal · 2021-01-20T20:33:22.738Z · LW · GW

Are EA meetups welcome?

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-01-16T12:03:13.489Z · LW · GW

When using infra-Bayesian logic to define a simplicity prior, it is natural to use "axiom circuits" rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols for repeating terms. This doesn't affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length *exponentially*.

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-01-16T11:54:38.406Z · LW · GW

Instead of introducing all the "algebrator" logical symbols, we can define as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:

- For any and (permutation), denote and require
- For any and ,

However, if we do this then it's not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-01-16T00:01:52.812Z · LW · GW

Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for *higher-order* logic. There might be holes and/or redundancies in the precise definitions given here, but I'm quite confident the overall idea is sound.

For simplicity, we will only work with crisp infradistributions, although a lot of this stuff can work for more general types of infradistributions as well. Therefore, will denote the space of crisp infradistribution. Given , will denote the corresponding convex set. As opposed to previously, we *will* include the empty-set, i.e. there is s.t. . Given and , will mean . Given , will mean .

**Syntax**

Let denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types by:

- (intended meaning: the uninhabited type)
- (intended meaning: the one element type)
- If then
- If then (intended meaning: disjoint union)
- If then (intended meaning: Cartesian product)
- If then (intended meaning: predicates with argument of type )

For each , there is a set which we interpret as atomic terms of type . We will denote . Among those we distinguish the *logical* atomic terms:

- Symbols we will not list explicitly, that correspond to the algebraic properties of and (commutativity, associativity, distributivity and the neutrality of and ). For example, given there is a "commutator" of type .
- (intended meaning: predicate evaluation)
- Assume that for each there is some : the set of "describable" infradistributions (for example, it can be empty, or consist of all distributions with rational coefficients, or all distributions, or all infradistributions;
**EDIT**: it is probably sufficient to only have the fair coin distribution in in order for it to be possible to approximate all infradistributions on finite sets). If then

We recursively define the set of all terms . We denote .

- If then
- If and then
- If and then
- If then
- If and then

Elements of are called formulae. Elements of are called sentences. A subset of is called a theory.

**Semantics**

Given , a *model* of is the following data. To each , there must correspond some compact Polish space s.t.:

- (the one point space)

To each , there must correspond a continuous mapping , under the following constraints:

- , , and the "algebrators" have to correspond to the obvious mappings.
- . Here, is the diagonal and is the sharp infradistribution corresponding to the closed set .
- Consider and denote . Then, . Here, we use the observation that the identity mapping can be regarded as an infrakernel from to .
- is the convex hull of
- Consider and denote , and the projection mapping. Then, .
- . Notice that pullback of infradistributions is always defined thanks to adding (the empty set infradistribution).

Finally, for each , we require .

**Semantic Consequence**

Given , we say when . We say when for any model of , . It is now interesting to ask what is the computational complexity of deciding . [**EDIT**: My current best guess is co-RE]

**Applications**

As usual, let be a finite set of actions and be a finite set of observation. Require that for each there is which we interpret as the type of states producing observation . Denote (the type of all states). Moreover, require that our language has the nonlogical symbols (the initial state) and, for each , (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2021-01-11T16:44:01.117Z · LW · GW

re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming.

You misunderstand the intent. We're talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown , but producing some behavior that optimizes the unknown . Ofc if the policy you're observing is optimal then it's trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like "the policy you're observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity."

(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)

(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)

I am referring to this and related work by Armstrong.

**vanessa-kosoy**on Launching Forecast, a community for crowdsourced predictions from Facebook · 2021-01-11T09:37:52.196Z · LW · GW

Hmm, what is the reasoning? Assuming it's a project that's directly related to core interests of the community?

**vanessa-kosoy**on Launching Forecast, a community for crowdsourced predictions from Facebook · 2021-01-11T07:53:43.661Z · LW · GW

IMO this post should be on the front page

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2020-12-28T07:15:40.010Z · LW · GW

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.

...I brought up the histories->states thing because I didn't understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?

AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are "selfish": there is a reward function which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the "unrolling" you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)

For a more complete framework, we can use an ontology chain, but (i) instead of labels use labels, where is the set of possible memory states (a policy is then described by ), to allow for agents that don't fully trust their memory (ii) consider another chain with a bigger state space plus a mapping s.t. the transition kernels are compatible. Here, the semantics of is: the multiset of ontological states resulting from interpreting the physical state by taking the viewpoints of different agents contains.

In fact, to me the example is still somewhat murky, because you're talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities.

I didn't understand "no actual agent in the information-state that corresponds to having those probabilities". What does it mean to have an agent in the information-state?

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2020-12-27T18:17:30.439Z · LW · GW

I'm not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It's like in classical RL theory, when you're proving a regret bound or whatever, your probability space consists of histories.

I'm still confused about what you mean by "Bayesian hypothesis" though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I'm talking about hypotheses which conform to the classical "cybernetic agent model". If you wish, we can call it "Bayesian cybernetic hypothesis".

Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we *can* give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.

Indeed, consider some . We can take its expected value to get . Assuming the chain is communicating, is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector . We then get the subjective transition kernel:

Now, consider the following example of an AMDP. There are three actions and two states . When we apply to an robot, it creates two robots, whereas when we apply to an robot, it leaves one robot. When we apply to an robot, it creates two robots, whereas when we apply to an robot, it leaves one robot. When we apply to any robot, it results in one robot whose state is with probability and with probability .

Consider the following two policies. takes the sequence of actions and takes the sequence of actions . A population that follows would experience the subjective probability , whereas a population that follows would experience the subjective probability . Hence, *subjective probabilities depend on future actions*. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.

**vanessa-kosoy**on Vanessa Kosoy's Shortform · 2020-12-26T16:31:31.278Z · LW · GW

I'm not sure what do you mean by that "unrolling". Can you write a mathematical definition?

Let's consider a simple example. There are two states: and . There is just one action so we can ignore it. is the initial state. An robot transition into an robot. An robot transitions into an robot *and* an robot. How will our population look like?

0th step: all robots remember

1st step: all robots remember

2nd step: 1/2 of robots remember and 1/2 of robots remember

3rd step: 1/3 of robots remembers , 1/3 of robots remember and 1/3 of robots remember

There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have and . But, to be consistent with step 3 we must have , .

In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step *will have learned this hypothesis with high probability*. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.

Or, at least it's not obvious there is such a hypothesis. In this example, will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don't know, maybe for finite state spaces it can work. Would definitely be interesting to check.

[EDIT: actually, in this example there is such a hypothesis but in general there isn't, see below]

**vanessa-kosoy**on Operationalizing compatibility with strategy-stealing · 2020-12-25T11:36:23.457Z · LW · GW

Notice that Eliezer's definition + the universal prior is essentially the same as the computationally unbounded variant of my definition of goal-directed intelligence, *if you fix the utility function and prior*. I think that when you're comparing different utility functions in your definition of strategy-stealing, it's better to emulate my definition and correct by the description complexity of the utility function. Indeed, if you have an algorithm that works equally well for all utility functions but contains a specification of the utility function inside its source code, then you get "spurious" variance if you don't correct for it (usually superior algorithms would also have to contain a specification of the utility function, so the more complex the utility function the less of them will be).

**vanessa-kosoy**on Cultural accumulation · 2020-12-06T22:37:30.562Z · LW · GW

Of particular interest would be the theory of computation, because the construction of a mechanical computer might be accomplished much earlier -- although, the construction of suitable clockwork would be required.

If you mean a *universal* mechanical computer (like Babbage's analytical engine) then, as far as I know, none was ever built, because it's actually really hard to have clockwork that good.

**vanessa-kosoy**on The LessWrong 2018 Book is Available for Pre-order · 2020-12-03T12:02:03.467Z · LW · GW

Great, thank you :)

**vanessa-kosoy**on The Parable of Predict-O-Matic · 2020-12-03T11:56:08.150Z · LW · GW

This is a rather clever parable which explains serious AI alignment problems in an entertaining form that doesn't detract from the substance.

**vanessa-kosoy**on Evolution of Modularity · 2020-12-03T11:52:47.522Z · LW · GW

I liked this post for talking about how evolution produces modularity (contrary to what is often said in this community!). This is something I suspected myself but it's nice to see it explained clearly, with backing evidence.

**vanessa-kosoy**on Book Review: The Secret Of Our Success · 2020-12-03T11:41:50.801Z · LW · GW

This is a good review of an intriguing thesis about human culture, with a bunch of neat factoids. The thesis has some major flaws IMO, but it is a useful perspective to know.

**vanessa-kosoy**on Reframing the evolutionary benefit of sex · 2020-12-03T11:33:59.284Z · LW · GW

This post gives an elegant explanation of the evolutionary benefit of sexual reproduction that was new to me, at least. I like it, although I also wish some expert added their thoughts.

**vanessa-kosoy**on Book Review: The Structure Of Scientific Revolutions · 2020-12-03T08:44:20.163Z · LW · GW

This post does a (probably) decent job of summarizing Kuhn's concept of paradigm shifts. I find paradigm shifts a useful way of thinking about the aggregation of evidence in complex domains.

**vanessa-kosoy**on The LessWrong 2018 Book is Available for Pre-order · 2020-12-02T12:06:21.533Z · LW · GW

Great news! However, it seems like there is no shipping to Israel. Is that something that can be fixed?

**vanessa-kosoy**on [Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology · 2020-12-01T21:14:52.427Z · LW · GW

Yes, I do know the physics involved on some level, and some about the computational methods.

I think that, if deep learning can predict protein folding then it should eventually be able to predict protein binding as well, since most of the physics is the same: it's just amino acids on two different peptide chains interacting, instead of amino acids on the same chain.

On the other hand, predicting which reaction an enzyme catalyzes involves more physics, so it could be much harder: but then again, maybe it isn't. Or maybe we can at least predict with which biomolecules a given protein is likely to react and do experimental work to find out the details.

**vanessa-kosoy**on [Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology · 2020-11-30T23:01:51.404Z · LW · GW

If you have a protein, and you know it's designed to bind to *something*, but you don't know to what, then maybe running a lot of imprecise simulations (using it's folded structure) will allow you to narrow down the list of candidates, and thereby significantly save the time and cost of experiments?

(Not an expert, just guessing)

**vanessa-kosoy**on It’s not economically inefficient for a UBI to reduce recipient’s employment · 2020-11-22T19:36:27.943Z · LW · GW

IMO the problem is that reducing incentives to work makes it hard to compute the actual cost of UBI. Naively, if we want to pay each person a UBI of X, all we need to do is multiply X by the size of the population. We can then infer how much it would cost to each given taxpayer. But, because of reduced incentives to work, there are additional effects s.t. reduction in tax revenue and increase in the price of labor (that propagates to other prices). The latter means we don't even know the value this X will have to the recipients.

**vanessa-kosoy**on Some AI research areas and their relevance to existential safety · 2020-11-21T14:12:18.990Z · LW · GW

A lot depends on AI capability as a function of cost and time. On one extreme, there might enough rising returns to get a singleton: some combination of extreme investment and algorithmic advantage produces extremely powerful AI, moderate investment or no algorithmic advantage doesn't produce moderately powerful AI. Whoever controls the singleton has all the power. On the other extreme, returns don't rise much, resulting in personal AIs having as much or more collective power as corporate/government AIs. In the middle, there are many powerful AIs but still not nearly as many as people.

In the first scenario, to get outcome C we need the singleton to either be democratic by design, or have a very sophisticated and robust system of controlling access to it.

In the last scenario, the free market would lead to outcome B. Corporate and government actors use their access to capital to gain power through AI until the rest of the population becomes irrelevant. Effectively, AI serves as an extreme amplifier of per-existing power differentials. Arguably, the only way to get outcome C is enforcing democratization of AI through regulation. If this seems extreme, compare it to the way our society handles physical violence. The state has monopoly on violence, and with good reason: without this monopoly, upholding the law would be impossible. But, in the age of superhuman AI, traditional means of violence are irrelevant. The only important weapon is AI.

In the second scenario, we can manage without multi-user alignment. However, we still need to have multi-AI alignment, i.e. make sure the AIs are good at coordination problems. It's possible that any sufficiently capable AI is automatically good at coordination problems, but it's not guaranteed. (Incidentally, if atomic alignment is flawed then it might be actually better for the AIs to be bad at coordination.)

**vanessa-kosoy**on Some AI research areas and their relevance to existential safety · 2020-11-20T16:51:42.909Z · LW · GW

Outcome C is most naturally achieved using "direct democracy" TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that's hard because:

- If the number of AIs is small, the AI interface becomes a single point of failure, an actor that can hijack the interface will have enormous power.
- If the number of AIs is small, it might be unclear what inputs should be fed into the AI in order to fairly represent the collective. It requires "manually" solving the preference aggregation problem, and faults of the solution might be amplified by the powerful optimization to which it is subjected.
- If the number of AIs is more than one then we should make sure the AIs are good at cooperating, which requires research about multi-AI scenarios.
- If the number of AIs is large (e.g. one per person), we need the interface to be sufficiently robust that people can use it correctly without special training. Also, this might be prohibitively expensive.

Designing democratic AI requires good theoretical solutions for preference aggregation and the associated mechanism design problem, and good practical solutions for making it easy to use and hard to hack. Moreover, we need to get the politicians to implement those solutions. Regarding the latter, the OP argues that certain types of research can help lay the foundation by providing actionable regulation proposals.

My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se

Well, the OP did say:

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

I understood it as hinting at outcome B, but I might be wrong.

**vanessa-kosoy**on Some AI research areas and their relevance to existential safety · 2020-11-20T16:24:01.731Z · LW · GW

Good point, acausal trade can at least ameliorate the problem, pushing towards atomic alignment. However, we understand acausal trade too poorly to be highly confident it will work. And, "making acausal trade work" might in itself be considered outside of the desiderata of atomic alignment (since it involves multiple AIs). Moreover, there are also actors that have a very low probability of becoming TAI users but whose support is beneficial for TAI projects (e.g. small donors). Since they have no counterfactual AI to bargain on their behalf, it is less likely acausal trade works here.

**vanessa-kosoy**on Some AI research areas and their relevance to existential safety · 2020-11-19T13:42:47.682Z · LW · GW

Among other things, this post promotes the thesis that (single/single) AI alignment is insufficient for AI existential safety and the current focus of the AI risk community on AI alignment is excessive. I'll try to recap the idea the way I think of it.

We can roughly identify 3 dimensions of AI progress: AI capability, atomic AI alignment and social AI alignment. Here, atomic AI alignment is the ability to align a single AI system with a single user, whereas social AI alignment is the ability to align the sum total of AI systems with society as a whole. Depending on the relative rates at which those 3 dimensions develop, there are roughly 3 possible outcomes (ofc in reality it's probably more of a spectrum):

Outcome A: The classic "paperclip" scenario. Progress in atomic AI alignment doesn't keep up with progress in AI capability. Transformative AI is unaligned with any user, as a result the future contains virtually nothing of value to us.

Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn't keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.

Outcome C: Both atomic and social alignment keep with with AI capability. Transformative AI is aligned with society/humanity as a whole, resulting in a benevolent future for everyone.

Ideally, Outcome C is the outcome we want (with the exception of people who decided to gamble on being part of the elite in outcome B). Arguably, C > B > A (although it's possible to imagine scenarios in which B < A). How does it translate into research priorities? This depends on several parameters:

- The "default" pace of progress in each dimension: e.g. if we assume atomic AI alignment will be solved in time anyway, then we should focus on social AI alignment.
- The inherent difficulty of each dimension: e.g. if we assume atomic AI alignment is relatively hard (and will therefore take a long time to solve) whereas social AI alignment becomes relatively easy once atomic AI alignment is solved, then we should focus on atomic AI alignment.
- The extent to which each dimension depends on others: e.g. if we assume it's impossible to make progress in social AI alignment without reaching some milestone in atomic AI alignment, then we should focus on atomic AI alignment for now. Similarly, some argued we shouldn't work on alignment at all before making more progress in capability.
- More precisely, the last two can be modeled jointly as the cost of marginal progress in a given dimension as a function of total progress in all dimensions.
- The extent to which outcome B is bad for people not in the elite: If it's not too bad then it's more important to prevent outcome A by focusing on atomic AI alignment, and vice versa.

The OP's conclusion seems to be that social AI alignment should be the main focus. Personally, I'm less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.

**vanessa-kosoy**on Thoughts on Voting Methods · 2020-11-18T19:37:54.294Z · LW · GW

I think that pie-cutting is usually negative-sum, because of diminishing returns and transaction costs. So, if you could make utilitarianism into a voting system it would at least ameliorate the problem (ofc we can't easily do that because of dishonesty). However, ideally what we probably want is not utilitarianism but something like a bargaining solution. Moreover, in practice we don't know the utility functions, so we should assume some prior distribution over possible utility functions and choose the voting system that minimizes some kind of expected regret.

**vanessa-kosoy**on Thoughts on Voting Methods · 2020-11-18T10:35:46.054Z · LW · GW

I think that the intuition that D is a good compromise candidate (in the second example) is wrong, since each of the voters would prefer a *random* candidate out of {A,B,C} to D. In other words, a uniform lottery over {A,B,C} is a Pareto improvement on D.

**vanessa-kosoy**on On Arguments for God · 2020-11-14T14:44:12.815Z · LW · GW

The important difference is that theists have a lot of specific assumptions about what the god(s) do(es). In particular, in the simulation hypothesis, there is no strong reason to assume the gods are in any way benevolent or care about any human-centric concepts.

**vanessa-kosoy**on What are Examples of Great Distillers? · 2020-11-12T21:24:24.893Z · LW · GW

Seconding John Baez

**vanessa-kosoy**on The Inefficient Market Hypothesis · 2020-11-08T18:23:57.099Z · LW · GW

I have three problems with this argument.

First, it's not always possible to bet capital. For example, suppose you figured out quantum gravity. How would you bet capital on that?

Second, secrecy is costly and it's not always worth it to pay the price. For example, it's much easier to find collaborators if you go public with your idea instead of keeping it secret.

Third, sometimes there is no short-term cost to reputation. If your idea goes against established beliefs, but you have really good *arguments* for it, other people won't necessarily think you're nuts, or at least the people who think you're nuts might be compensated by the people who think you're a genius.

**vanessa-kosoy**on Multiple Worlds, One Universal Wave Function · 2020-11-07T14:22:12.974Z · LW · GW

Then what would you call reality? It sure seems like it's well-described as a mathematical object to me.

I call it "reality". It's irreducible. But I feel like this is not the most productive direction to hash out the disagreement.

Put a simplicity prior over the combined difficulty of specifying a universe and specifying you within that universe. Then update on your observations.

Okay, but then the separation between "specifying a universe" and "specifying you within that universe" is meaningless. Sans this separation, your are just doing simplicity-prior-Bayesian-inference. If that's what you're doing, the Copenhagen interpretation is what you end up with (modulo the usual problems with Bayesian inference).

You can mathematically well-define 1) a Turing machine with access to randomness that samples from a probability measure and 2) a Turing machine which actually computes all the histories (and then which one you find yourself in is an anthropic question). What quantum mechanics says, though, is that (1) actually doesn't work as a description of reality, because we see interference from those other branches, which means we know it has to be (2).

I don't see how you get (2) out of quantum mechanics.

**vanessa-kosoy**on Multiple Worlds, One Universal Wave Function · 2020-11-06T11:33:37.532Z · LW · GW

I disagree. "in what mathematical entity do we find ourselves?" is a map-territory confusion. We are not in a mathematical entity, we use mathematics to construct *models* of reality. And, in any case, without "locating yourself within the object", it's not clear how do you know whether your theory is true, so it's very much pertinent to physics.

Moreover, I'm not sure how this perspective justifies MWI. Presumably, the wavefunction contains multiple "worlds" hence you conclude that multiple worlds "exist". However, consider an alternative universe with stochastic classical physics. The "mathematical entity" would be a probability measure over classical histories. So it can also be said to contains "multiple worlds". But in that universe everyone would be comfortable with saying there's just one non-deterministic world. So, you need something else to justify the multiple worlds, but I'm not sure what. Maybe you would say the stochastic universe also has multiple worlds, but then it starts looking a like a philosophical assumption that doesn't follow from physics.