vanessa-kosoy

In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it's formalized via a loss function. Roughly speaking, the loss function expresses preferences over "which computations are running". This means that you can have a "positive" preference for a particular computation to run or a "negative" preference for a particular computation not to run^[1].

^{^}
There are also more complicated possibilities, such as "if P runs then I want Q to run but if P doesn't run then I rather that Q also doesn't run" or even preferences that are only expressible in terms of entanglement between computations.

Comment by Vanessa Kosoy (vanessa-kosoy) on New Paper: Infra-Bayesian Decision-Estimation Theory · 2025-04-13T17:41:33.901Z · LW · GW

It's roughly on the right track, but here are some inaccuracies in your description that stood out to me:

There is no requirement that the "hidden state space" is finite. It is perfectly fine to consider a credal set which is not a polytope (i.e. not a convex hull of a finite set of distributions).
The point of how market prices are computed, missing from your description, is that they prevent any bettor from making unbounded earnings (essentially, by making them bet against each other). This is the same principle as Garrabrant induction. In particular, this implies that if any of our models is true then the market predictions will converge to lying inside the corresponding credal set.
The market predictions do not somehow assume that "the parts of the universe we don't observe are out to get us". Thanks to the pessimistic better, they do satisfy the "not too optimistic condition", but that's "not too optimistic" relatively to the true environment.
Your entire description only talks about the "estimation" part, not about the "decision" part.

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Rules for an Algebra of Bayes Nets · 2025-04-12T15:03:26.402Z · LW · GW

I think that there are two different questions which might be getting mixed up here:

Question 1: Can we fully classify all rules for which sets of Bayes nets imply other Bayes nets over the same variables? Naturally, this is not a fully rigorous question, since "classify" is not a well-defined notion. One possible operationalization of this question is: What is the computational complexity of determining whether a given Bayes net follows from a set of other Bayes nets? For example, if there is a set of basic moves that generate all such inferences then the problem is probably in NP (at least if the number of required moved can also be bounded).

Question 2: What if we replace "Bayes nets" by something like "string diagrams in a Markov category"? Then there might be less rules (because maybe some rules hold for Bayes nets but not in the more abstract setting).

Comment by Vanessa Kosoy (vanessa-kosoy) on New Paper: Infra-Bayesian Decision-Estimation Theory · 2025-04-10T09:33:38.150Z · LW · GW

Thank you <3

Any chance of more exposition for those of us less cognitively-inclined? =)

Read the paper! :)

It might seem long at first glance, but all the results are explained in the first 13 pages, the rest is just proofs. If you don't care about the examples, you can stop on page 11. Naturally, I welcome any feedback on the exposition there.

Comment by Vanessa Kosoy (vanessa-kosoy) on abramdemski's Shortform · 2025-04-09T06:40:53.691Z · LW · GW

I think that in 2 years we're unlikely to accomplish anything that leaves a dent in P(DOOM), with any method, but I also think it's more likely than not that we actually have >15 years.

As to "completing" the theory of agents, I used the phrase (perhaps perversely) in the same sense that e.g. we "completed" the theory of information: the latter exists and can actually be used for its intended applications (communication systems). Or at least in the sense we "completed" the theory of computational complexity: even though a lot of key conjectures are still unproven, we do have a rigorous understanding of what computational complexity is and know how to determine it for many (even if far from all) problems of interest.

I probably should have said "create" rather than "complete".

Comment by Vanessa Kosoy (vanessa-kosoy) on abramdemski's Shortform · 2025-04-08T17:49:13.987Z · LW · GW

(Summoned by @Alexander Gietelink Oldenziel)

I don't understand this comment. I usually don't think of "building a safer LLM agent" as a viable route to aligned AI. My current best guess about how to create aligned AI is Physicalist Superimitation. We can imagine other approaches, e.g. Quantilized Debate, but I am less optimistic there. More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.

As to heuristic implementations of infra-Bayesianism, this is something I don't want to speculate about in public, it seems exfohazardous.

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Rules for an Algebra of Bayes Nets · 2025-04-08T15:16:00.327Z · LW · GW

Somehow being able to derive all relevant string diagram rewriting rules for latential string diagrams, starting with some fixed set of equivalences?

What are "latential" string diagrams?

What does it it mean that you can't derive them all from a "fixed" set? Do you imagine some strong claim e.g. that the set of rewriting rules being undecidable, or something else?

Two Bayes nets are of the same Markov equivalence class when they have precisely the same set of conditionality relations holding on them (and by extension, precisely the same undirected skeleton).

Okay, so this is not what you care about? Maybe you are saying the following: Given two diagrams X,Y, we want to ask whether any distribution compatible with X is compatible with Y. We don't ask whether the converse also holds. This is a certain asymmetric relation, rather than an equivalence.

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Rules for an Algebra of Bayes Nets · 2025-04-08T11:45:25.362Z · LW · GW

I found the above comment difficult to parse.

that's not a thing that really happens?

What is the thing that doesn't happen? Reading the rest of the paragraph only left me more confused.

we don't quite care about Markov equivalence class

What do you mean by "Markov equivalence class"?

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-03-25T08:23:50.427Z · LW · GW

Thanks for this!

What I was saying up there is not a justification of Hurwicz' decision rule. Rather, it is that if you already accept the Hurwicz rule, it can be reduced to maximin, and for a simplicity prior the reduction is "cheap" (produces another simplicity prior).

Why accept the Hurwicz' decision rule? Well, at least you can't be accused of a pessimism bias there. But if you truly want to dig deeper, we can start instead from an agent making decisions according to an ambidistribution, which is a fairly general (assumption-light) way of making decisions. I believe that a similar argument (easiest to see in the LF-dual cramble set representation) would allow reducing that to maximin on infradistributions (credal sets).

To make such an approach even more satisfactory, it would be good to add a justification for a simplicity ambi/infra-prior. I think this should be possible by arguing from "opinionated agents": the ordinary Solomonoff prior is the unique semicomputable one that dominates all semicomputable measures, which decision-theoretically corresponds to something like "having preferences about as many possible worlds as we can". Possibly, the latter principle formalized can be formalized into something which ends up picking out an infra-Solomonoff prior (and, replacing "computability" by a stronger condition, some other kind of simplicity infra-prior).

Comment by Vanessa Kosoy (vanessa-kosoy) on Compositional language for hypotheses about computations · 2025-03-24T09:03:36.137Z · LW · GW

You now understand correctly. The reason I switch to colored operads is to add even more generality. My key use case is when the operad consists of terms-with-holes in a programming language, in which case the colors are the types of the terms/holes.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-03-23T17:53:07.760Z · LW · GW

The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.

In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.

We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent's history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.

Roughly speaking:

The Simulator is a universal computer whose function is performing computational experiments, which we can think of as "thought experiments" or coarse-grained simulations of potential plans. It receives commands from the Learner (which computations to run / threads to start/stop) and reports to the Learner the results. We denote the Simulator's input alphabet by and output alphabet by $O_{S}$ .
The Learner is the machine learning (training) module. The algorithm whose desideratum we're specifying resides here.
The Controller (as in "control theory") is a universal computer connected to the agent's external interface (environmental actions $A$ and observations $O$ ). It's responsible for real-time interaction with the environment, and we can think of it as the learned policy. It is programmed by the Learner, for which purpose it has input alphabet $I_{C}$ .

We will refer to this as the SiLC architecture.

Let $H \subseteq □ Γ$ be our hypothesis class about computable mathematics. Let $ξ : Γ \to □ 2^{Γ}$ be our prior about the physical universe^[1]. These have to satisfy the coherence conditions

\forall y \in Γ, α \in s u p p ξ (y) : y \in α

\forall y, y^{'} \in Γ, θ \in ξ (y) : χ_{y^{'} \in α} θ \in ξ (y^{'})

\forall Θ \in H, θ \in Θ, s : Γ \to Γ, ϕ : Γ \to Δ^{c} Γ : ϕ ⪯ ξ ⟹ {p r}_{Γ} χ_{{e l}^{Γ}} (s \times {i d}_{2^{Γ}})_{*} (θ ⋉ ϕ) \in Θ

Here, $ϕ ⪯ ξ$ means that $\forall y \in Γ : ϕ (y) \in ξ (y)$ .

Together, these ensure that $Θ ⋉ ξ$ is a coherent IBP hypothesis. Notice that for any $ξ_{0} : Γ \to □ 2^{Γ}$ satisfying the first condition^[2], there is a unique minimal coherent $ξ : Γ \to □ 2^{Γ}$ s.t. $\forall y \in Γ : ξ_{0} (y) \subseteq ξ (y)$ . Moreover, given a coherent $ξ$ and any $Θ_{0} \in □ Γ$ , there is a unique minimal coherent $Θ \in □ Γ$ s.t. $Θ_{0} \subseteq Θ$ .

The duration of the Training phase will be denoted by $τ \in N$ ^[3]. We can think of it as "computational time".

Let the source codes of the Learner (obtained by quining), the Simulator and the Controller respectively be denoted by

D_{L} : Γ \times N \times (I_{S} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})

D_{S} : Γ \times I_{S}^{+} \to O_{S}

D_{C} : Γ \times I_{C}^{*} \times O^{*} \to A

Here, the $N$ argument of $D_{L}$ corresponds to $τ$ and $Δ_{Q}$ is a probability distribution in which all probabilities are rational numbers^[4].

We assume that the simulator can indeed run any computation, and that any given halting computation would run fast for $τ ≫ 0$ . These are assumptions on $D_{S}$ (or, on some combination of (i) $D_{S}$ , (ii) the definition of $Γ$ , and (iii) the support of all $Θ \in H$ ) that we will not spell out here.

We will say that a policy is a mapping of type $(I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ and a metapolicy is a mapping of type $N \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ .

Given any $D_{L}^{'} : Γ \times N \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ , we can compose it with $D_{S}$ and $D_{C}$ in the obvious way^[5] to yield

D_{S} \otimes D_{L}^{'} \otimes D_{C} : Γ \times N \times (A \times O)^{*} \to Δ_{Q} A

In particular, we can take $D_{L}^{'} = μ$ for some metapolicy $μ$ by postulating no dependence on the $Γ$ argument.

Denote by $P$ the set of all policies. Given metapolicy $μ$ and $π \in P$ , we define $μ^{π} : [τ + 1] \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ by

μ^{π} (k, p q) := {\begin{matrix} μ (k, p q) if k < τ π (p q) if k = τ \end{matrix}

Given any $ν : [τ + 1] \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ , we say that $y \in Γ$ is a $ν^{τ}$ -consistent counterpossible when the following conditions hold:

For all $k < τ$ and $h \in (I_{S} \times I_{C} \times O_{S})^{< k}$ , $D_{L} (y, k, p q) = ν (k, p q)$
For all $k \leq τ$ and $h \in (A \times O)^{*}$ , $(D_{S} \otimes D_{L} \otimes D_{C}) (y, k, p q) = (D_{S} \otimes ν \otimes D_{C}) (y, k, p q)$

We denote by $C_{ν}^{τ} \subseteq Γ$ the set of $ν^{τ}$ -consistent counterpossibles.

A (deterministic) copolicy is a mapping of signature $I_{S}^{+} \to O_{S}$ . We denote the set of copolicies by $C$ . Given a policy $π$ and a copolicy $ν$ , we define $π^\otimes ν \in Δ (I_{S} \times I_{C} \times O_{S})^{*}$ in the obvious way. Given policies $π_{1}, π_{2}$ , we define their total variation distance^[6] to be

d_{TV} (π_{1}, π_{2}) := max ν \in C d_{TV} (π_{1}^\otimes ν, π_{2}^\otimes ν)

Given $Ξ \in □ (Γ \times 2^{Γ})$ , $f : Γ \times 2^{Γ} \to [0, \infty)$ , $τ \in N$ and metapolicy $μ$ , we will use the notation

E_{Ξ} [f ∣∣ μ^{τ}] := min π \in P (E_{Ξ} [f \cdot χ_{C_{μ^{π}}^{τ}}] + d_{TV} (π, μ (τ)))

Intuitively, $E_{Ξ} [f ∣∣ μ^{τ}]$ should be thought as the counterfactual expectation of loss function $f$ assuming metapolicy $μ$ , while adding a "bonus" to account for "fair" treatment of randomization by the agent. More on that below.

Given a metapolicy $μ$ and $τ \in N$ , we define $E_{μ}^{τ} \subseteq 2^{Γ}$ by

E_{μ}^{τ} := {α \in 2^{Γ} ∣ \exists π \in P : C_{μ^{π}}^{τ} \cap α = \emptyset}

Intuitively, $E_{μ}^{τ t}$ is the set of universe states for which at least one copy of the agent exists which followed the metapolicy $μ$ until computational time $τ$ .

Given a loss function $l : N \times 2^{Γ} \to [0, 1]$ ^[7] (which we allow to explicitly depend on computational time for greater generality), the learning condition on a metapolicy $μ^{*}$ and hypothesis $Θ \in H$ is

\sum τ < N E_{ξ_{*} Θ} [l (τ) ∣∣ μ^{* τ}] \leq \sum τ < N (min π \in P E_{ξ_{*} Θ} [l (τ) \cdot χ_{C_{μ^{* π}}^{τ}}] + E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}] ϵ (Θ, N))

Here, $ϵ$ is the "regret bound" function which should vanish in the $N \to \infty$ limit.

Some remarks on the particulars of this definition:

There are several reasons we impose $(D_{S} \otimes D_{L} \otimes D_{C}) (y, k, p q) = (D_{S} \otimes ν \otimes D_{C}) (y, k, p q)$ rather than $D_{L} (y, k, p q) = ν (k, p q)$ :
First, we want to ignore the side effects of running computations on the Simulator (both causal side effects and "mindcrime", i.e. the direct contribution of those computations to $l$ ). Because, taking side effects into account is usually inconsistent with the unlimited experimentation needed for learning.
Second, learning requires trusting the reports of the Simulator, which means we should only impose the policy on copies of $D_{L}$ that are actually connected to $D_{S}$ .
Third, we should also be able to trust the Controller, because otherwise we lose the semantic grounding of the agent's external interface. (Even though this is not necessary for learning per se.).
On the other hand, we impose $D_{L} (y, k, p q) = ν (k, p q)$ in the computational past because that's valid additional information that doesn't interfere with the learning or decision theory.
The learning criterion treats computational time myopically, so that we won't have to worry about traps in computational time.
The reason we need randomization is, it's often necessary for efficient learning. In the simplest non-trivial examples, we learn by IID sampling a distribution over computations (e.g. we simulate the interaction between a particular policy and our physical prior $ξ$ ). If we sampled deterministically instead, Murphy would be able to fool us by changing behavior precisely at the sampled instances.
The reason we need $E_{Ξ} [f ∣∣ μ^{τ t}]$ is, randomization only helps if low probability events can be ignored. However, if sufficiently many copies of the agents are instantiated, even a low probability even would be detectable. Hence, we use a "forgiving" metric that assigns low loss even to distributions that technically have high loss but are close to a different distribution with low loss.
We can consider Newcombian problems where Omega makes decisions based on the agent's action probabilities. I suspect that if Omega's policy is Lipschitz in the agent policy, the behavior advised by the $E_{Ξ} [f ∣∣ μ^{τ t}]$ counterfactual will converge to FDT-optimal in the limit of sufficiently many iterations.
Both in the case of ignoring side effects of computations and in the case of the treatment of randomization, we can be accused of departing from priorism ("updatelessness"). However, I think that this departure is justified. In the original TDT paper, Yudkowsky addressed the "Omega rewards irrationality" objection by postulating that, a decision problem is "fair" when it only depends on the agent's decisions rather than on how the agent makes those decisions. Here, we use the same principle: the agent should not be judged based on its internal thought process (side effects), and it should in some sense be judged based on its decisions rather than the probabilities assigned to those decisions.
Also about priorism, this kind of agents will not endorse iterated-in-computational-time "logical" counterfactual mugging when the same coin is reused indefinitely, but will endorse it when a new coin is used every time, for an appropriate definition of "new" (or e.g. we switch to a new coin every $k$ rounds). Arguably, this solves the tension between priorism and learning observed by Demski. Formulating the precise criterion when Learning-IBP behavior converges to priorist / FDT-optimal is left for further research.
The dependence of $ϵ (Θ, N)$ on $Θ$ should ultimately involve some kind of description complexity. However, it will also involve something in the vein of "what are the computational resource bounds, according to the belief $Θ$ , for running certain computations, selected for their importance in testing $Θ$ ". In particular, we won't require the agent to learning anything about non-halting computations. Indeed, any hypothesis about such computations will either assert a time bound on running the non-halting computations (in which case it is false) or will fail to assert any such bound, in which case its learning complexity is known to be infinite.
We could make do without the $E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}]$ factor but that would make the learning criterion weaker. The presence of this factor means that, roughly speaking, regret should be low even conditional on the agent existing, which seems like a reasonable demand.
Given an AI designed along these principles, we might worry about the impact of the side effects that are ignored. Maybe these can produce non-Cartesian daemons. However, during the Training phase, the algorithm has no access to external observation, which arguably makes it unlikely anything inside it can learn how to cyberattack. Moreover, during the Deployment phase, any reason for concern would be mediated by the particular algorithm the Controller runs (rather than the particulars of how it's implemented), which is what we do take into account in our optimization. Finally, the agent might be able to self-modify to make itself safer: we can even intentionally give it the means to do so (as part of its action space $A$ ). This probably requires careful prior-shaping to work well.

^{^}
This framework assumes all our hypotheses are disintegrable w.r.t. the factorization into $Γ$ and $2^{Γ}$ . It is an interesting question to understand whether we should or can relax this assumption.
^{^}
For example, we can imagine $ξ_{0}$ to be a Solomonoff-like prior along the following lines. Every hypothesis comprising $ξ_{0}$ is defined by a Turing machine $M$ with access to two oracles representing $y, y^{'} \in Γ$ and two tapes of random and "ambiguous" bits respectively. $ξ_{0} (y)$ is defined by running $M$ with one oracle fielding queries about $y$ (i.e. we given a program $P$ we can request to know its counterpossible output $P^{y}$ ) and the other oracle fielding queries about some $y^{'}$ s.t. we want to decide whether $y^{'} \in α$ for $α \sim θ \in ξ_{0} (y)$ . $M$ is only allowed to return NO if there was at least one query to which the two oracles gave different answers.
^{^}
We use the "duration" interpretation for simplicity, but more generally $τ$ can be some parameter controlling the computing resources available in the Training phase, and we can also allow the computing resources of the Controller to scale with $τ$ .
^{^}
The reason we restrict to rational numbers is because we need a notion of computing the distribution. It is in principle possible to generalize further to computable numbers. On the other hand, it might be more realistic to constrain even further to e.g. dyadic rationals (which can be implemented via fair coinflips). We stick to $Q$ for simplicity.
^{^}
We let the Learner interact with the Simulator for $τ$ timesteps, producing some output $g \in I_{C}^{τ}$ , and then run the Controller with $g$ as an input.
^{^}
This is not technically a distance since it is possible to have $d_{TV} (π_{1}, π_{2}) = 0$ if $π_{1} \neq π_{2}$ so long as $π_{1}$ and $π_{2}$ only disagree on histories that are inconsistent with these policies. Such $π_{1}$ and $π_{2}$ are morally equivalent.
^{^}
We could also allow $l$ to have a $Γ$ argument, but then we would have to remove the $E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}]$ factor from the learning condition, because the choice of policy would matter intrinsically even if the agent doesn't exist. Alternatively, we could modify the definition of $C_{μ}^{τ}$ to avoid that. Or perhaps use some normalization factor more complicated than $E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}]$ .

Comment by Vanessa Kosoy (vanessa-kosoy) on Compositional language for hypotheses about computations · 2025-03-23T14:30:07.585Z · LW · GW

No? The elements of an operad have fixed arity. When defining a free operad you need to specify the arity of every generator.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Learning-Theoretic Agenda: Status 2023 · 2025-03-12T15:02:46.062Z · LW · GW

I don't think that undecidability of exact comparison (as opposed to comparison within any given margin of error) is necessarily a bug, however, if you really want comparison for periodic sequences, you can insist that the utility function is defined by a finite state machine. This is in any case already a requirement in the bounded compute version.

Comment by Vanessa Kosoy (vanessa-kosoy) on [Closed] Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme · 2025-02-24T12:20:17.814Z · LW · GW

So far interest in the programme was modest. I would appreciate it to hear from people who either (i) deliberated whether to apply and decided against it or (ii) feel that they might meet the requirements but are not interested. Specifically, what held you back and what changes (if any) would persuade you to apply?

Comment by Vanessa Kosoy (vanessa-kosoy) on The Learning-Theoretic Agenda: Status 2023 · 2025-02-22T09:22:23.643Z · LW · GW

First, it's uncomputable to measure performance because that involves the Solomonoff prior. You can approximate it if you know some bits of Chaitin's constant, but that brings a penalty into the description complexity.

Second, I think that saying that comparison is computable means that the utility is only allowed to depend on a finite number of time steps, it rules out even geometric time discount. For such utility functions, the optimal policy has finite description complexity, so g is upper bounded. I doubt that's useful.

Comment by Vanessa Kosoy (vanessa-kosoy) on [Closed] Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme · 2025-02-16T18:18:44.254Z · LW · GW

I added some examples to the end of this post, thank you for the suggestion.

Comment by Vanessa Kosoy (vanessa-kosoy) on Announcement: Learning Theory Online Course · 2025-01-24T11:00:07.177Z · LW · GW

Not sure these are the best textbooks, but you can try:

"Naive Set Theory" by Halmos
"Probability Theory" by Jaynes
"Introduction to the Theory of Computation" by Sipser

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-01-23T17:38:36.133Z · LW · GW

Another excellent catch, kudos. I've really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it's also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it's selecting somethings that's very close to delegating some decision power to chance.

Also, maybe this cannot happen when is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I'm not sure.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-01-23T10:59:44.854Z · LW · GW

Example: Let , and $D$ consist of the probability intervals $Θ_{0} := [0, \frac{2}{3}]$ , $Θ_{1} := [\frac{1}{3}, 1]$ and $Θ_{2} := [\frac{1}{3}, \frac{2}{3}]$ . Then, it is (I think) consistent with the desideratum to have $Θ^{*} = Θ_{2}$ .

Not only that interpreting $Θ^{*} = Θ_{2}$ requires an unusual decision rule (which I will be calling "utility hyperfunction"), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-01-23T10:31:59.145Z · LW · GW

You're absolutely right, good job! I fixed the OP.

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-01-22T13:46:53.636Z · LW · GW

TLDR: Systems which locally maximal influence can be described as VNM decision-makers.

There are at least 3 different motivations leading to the concept of "agent" in the context of AI alignment:

The sort of system we are concerned about (i.e. which poses risk)
The sort of system we want to build (in order to defend from dangerous systems)
The sort of systems that humans are (in order to meaningfully talk about "human preferences")

Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we're describing something that already exists, on the other hand, the concept of "preferences" inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.

Here, we will focus on motivation #1.

A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.

Bayesian VNM

Consider a nice space^[1] of possible "outcomes", and a system that can choose^[2] out of a closed set of distributions $D \subseteq Δ X$ . I propose that an influential system should satisfy the following desideratum:

The system cannot select $μ^{*} \in D$ which can be represented as a non-trivial lottery over other elements in $D$ . In other words, $μ^{*}$ has to be an extreme point of the convex hull of $D$ .

Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome $μ \in Δ X$ , or outcome $ν \in Δ X$ but chooses instead outcome $p μ + (1 - p) ν$ , for $p \in (0, 1)$ and $μ \neq ν$ , this means the system gave up on its ability to choose between $μ$ and $ν$ in favor of a $p$ -biased coin. Such a system is not "locally^[3] maximally" influential^[4].

[EDIT: The original formulation was wrong, h/t @harfe for catching the error.]

The desideratum implies that there is a convergent sequence of utility functions ${u_{k} : X \to R}_{k \in N}$ s.t.

For every $k \in N$ , $E_{μ} [u_{k}]$ has a unique maximum $μ_{k}$ in $D$ .
The sequence $μ_{k}$ converges to $μ^{*}$ .

In other words, such a system can be approximated by a VNM decision-maker within any precision. For finite $D$ , we don't need the sequence, instead there is some $u : X \to R$ s.t. $μ^{*}$ is the unique maximum of $E_{μ} [u]$ over $D$ . This observation is mathematically quite simple, but I haven't seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).

Infra-Bayesian VNM?

Now, let's say that the system is choosing out of a set of credal sets (crisp infradistributions) $D \subseteq □ X$ . I propose the following desideratum:

[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]

Let $^D$ be the closure of $D$ w.r.t. convex combinations and joins^[5]. Let $Θ^{*} \in □ X$ be selected by the system. Then:

For any $Φ, Ψ \in^D$ and $p \in (0, 1)$ , if $Θ^{*} = p Φ + (1 - p) Ψ$ then $Φ = Ψ$ .
For any $Φ \in^D$ , if $Φ \subseteq Θ^{*}$ then $Φ = Θ^{*}$ .

The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).

We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don't think this is true, as the following example shows:

Example: Let $X = {0, 1}$ , and $D$ consist of the probability intervals $Θ_{0} := [0, \frac{2}{3}]$ , $Θ_{1} := [\frac{1}{3}, 1]$ and $Θ_{2} := [\frac{1}{3}, \frac{2}{3}]$ . Then, it is (I think) consistent with the desideratum to have $Θ^{*} = Θ_{2}$ .

Instead, I have the following conjecture:

Conjecture: There exists some space $Y$ , some $ξ \in Δ Y$ and convergent sequence ${u_{k} : Y \times X \to R}_{k \in N}$ s.t.

Θ^{*} = lim k \to \infty arg max Θ \in D E_{y \sim ξ} [min μ \in Θ E_{x \sim μ} [u_{k} (y, x)]]

As before, the maxima should be unique.

Such a "generalized utility function" can be represented as an ordinary utility function with a latent $Y$ -valued variable, if we replace $D$ with $D^{'} \subseteq □ (Y \times X)$ defined by

D^{'} := {ξ ⋉ Θ ∣ Θ \in D}

However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of "maximally influential under computational constraints" which implies an "ordinary" maximin decision rule.

This approach does rule out optimistic or "mesomistic" decision-rules. Optimistic decision makers tend to give up on influence, because they believe that "nature" would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.

Sequential Decision-Making

What would be the implications in a sequential setting? That is, suppose that we have a set of actions $A$ , a set of observations $O$ , $X := (A \times O)^{ω}$ , a prior $ζ : (A \times O)^{*} \times A \to Δ O$ and

D := {ζ π ∣ π : O^{*} \to A}

In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn't capture the notion of "influence" sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively "faraway" locations.

One way to fix some of the problem is, take $X := O^{ω}$ and define $D$ accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select "long-term" influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually "far-reaching" influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set $X := 2^{Γ}$ , i.e. select for system that have influence over physically manifest computations.

^{^}
I won't keep track of topological technicalities here, probably everything here works at least for compact Polish spaces.
^{^}
Meaning that the system has some output, and different counterfactual outputs correspond to different elements of $D$ .
^{^}
I say "locally" because it refers to something like a partial order, not a global scalar measure of influence.
^{^}
See also Yudkowsky's notion of efficient systems "not leaving free energy".
^{^}
That is, if $Ψ, Φ \in^D$ then their join (convex hull) $Ψ \lor Φ$ is also in $^D$ , and so is $p Ψ + (1 - p) Φ$ for every $p \in [0, 1]$ . Moreover, $^D$ is the minimal closed superset of $D$ with this property. Notice that this implies $^D$ is closed w.r.t. arbitrary infra-convex combinations, i.e. for any $Y$ , $K : Y \to^D$ and $Ξ \in □ Y$ , we have $K_{*} Ξ \in^D$ .

Comment by Vanessa Kosoy (vanessa-kosoy) on Vanessa Kosoy's Shortform · 2025-01-22T12:27:42.670Z · LW · GW

Master post for selection/coherence theorems. Previous relevant shortforms: learnability constraints decision rules, AIT selection for learning.

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-17T10:09:36.530Z · LW · GW

Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-16T15:05:52.386Z · LW · GW

Relevant link

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-16T14:49:41.062Z · LW · GW

Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.

Comment by Vanessa Kosoy (vanessa-kosoy) on What is the most impressive game LLMs can play well? · 2025-01-15T10:59:20.435Z · LW · GW

Relevant: Manifold market about LLM chess

Comment by Vanessa Kosoy (vanessa-kosoy) on Are there cognitive realms? · 2025-01-12T14:32:02.812Z · LW · GW

This post states and speculates on an important question: are there different mind types that are in some sense "fully general" (the author calls it "unbounded") but are nevertheless qualitatively different. The author calls these hypothetical mind taxa "cognitive realms".

This is how I think about this question, from within the LTA:

To operationalize "minds" we should be thinking of learning algorithms. Learning algorithms can be classified according to their "syntax" and "semantics" (my own terminology). Here, semantics refers to questions such as (i) what type of object is the algorithm learning (ii) what is the feedback/data available to the algorithm and (iii) what is the success criterion/parameter of the algorithm. On the other hand, syntax refers to the prior and/or hypothesis class of the algorithm (where the hypothesis class might be parameterized in a particular way, with particular requirements on how the learning rate depends on the parameters).

Among different semantics, we are especially interested in those that are in some sense agentic. Examples include reinforcement learning, infra-Bayesian reinforcement learning, metacognitive agents and infra-Bayesian physicalist agents.

Do different agentic semantics correspond to different cognitive realms? Maybe, but maybe not: it is plausible that most of them are reflectively unstable. For example Christiano's malign prior might be a mechanism for how all agents converge to infra-Bayesian physicalism.

Agents with different syntaxes is another candidate for cognitive realms. Here, the question is whether there is an (efficiently learnable) syntax that is in some sense "universal": all other (efficiently learnable) syntaxes can be efficiently translated into it. This is a wide open question. (See also "frugal universal prior".)

In the context of AI alignment, in order to achieve superintelligence it is arguably sufficient to use a syntax equivalent to whatever is used by human brain algorithms. Moreover, it's plausible that any algorithm we can come up can only have an equivalent or weaker syntax (the process of us discovering the new syntax suggests an embedding of the new syntax into our own). Therefore, even if there are many cognitive realms, then for our purposes we mostly only care about one of them. However, the multiplicity of realms has implications on how simple/natural/canonical should we expect the choice of syntax for our theory of agents to be (the less realms, the more canonical).

Comment by Vanessa Kosoy (vanessa-kosoy) on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T11:41:36.918Z · LW · GW

I think that there are two key questions we should be asking:

Where is the value of a an additional researcher higher on the margin?
What should the field look like in order to make us feel good about the future?

I agree that "prosaic" AI safety research is valuable. However, at this point it's far less neglected than foundational/theoretical research and the marginal benefits there are much smaller. Moreover, without significant progress on the foundational front, our prospects are going to be poor, ~no matter how much mech-interp and talking to Claude about feelings we will do.

John has a valid concern that, as the field becomes dominated by the prosaic paradigm, it might become increasingly difficult to get talent and resources to the foundational side, or maintain memetically healthy coherent discourse. As to the tone, I have mixed feelings. Antagonizing people is bad, but there's also value in speaking harsh truths the way you see them. (That said, there is room in John's post for softening the tone without losing much substance.)

Comment by Vanessa Kosoy (vanessa-kosoy) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T10:28:40.048Z · LW · GW

Learning theory, complexity theory and control theory. See the "AI theory" section of the LTA reading list.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T13:08:21.906Z · LW · GW

Good post, although I have some misgivings about how unpleasant it must be to read for some people.

One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations^[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI's extreme pessimism deflates motivation and naturally produces the thought "if they are right then we're doomed anyway, so might as well assume they are wrong".

Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I'm also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization^[2].

MIRI made important progress in agent foundations, but also missed an opportunity to do much more. And, while the AI game board is grim, their extreme pessimism is unwarranted overconfidence. Our understanding of AI and agency is poor: this is a strong reason to be pessimistic, but it's also a reason to maintain some uncertainty about everything (including e.g. timelines).

Now, about what to do next. I agree that we need to have our own non-streetlighting community. In my book "non-streelighting" means mathematical theory plus empirical research that is theory-oriented: designed to test hypotheses made by theoreticians and produce data that best informs theoretical research (these are ~necessary but insufficient conditions for non-streetlighting). This community can and should engage with the rest of AI safety, but has to be sufficiently undiluted to have healthy memetics and cross-fertilization.

What does a community look like? It looks like our own organizations, conferences, discussion forums, training and recruitment pipelines, academia labs, maybe journals.

From my own experience, I agree that potential contributors should mostly have skills and knowledge on the level of PhD+. Highlighting physics might be a valid point: I have a strong background in physics myself. Physics teaches you a lot about connecting math to real-world problems, and is also in itself a test-ground for formal epistemology. However, I don't think a background in physics is a necessary condition. At the very least, in my own research programme I have significant room for strong mathematicians that are good at making progress on approximately-concrete problems, even if they won't contribute much on the more conceptual/philosophic level.

^{^}
Which is, creating mathematical theory and tools for understanding agents.
^{^}
I mostly didn't feel comfortable talking about it in the past, because I was on MIRI's payroll. This is not MIRI's fault by any means: they never pressured me to avoid voicing opinions. It still feels unnerving to criticize the people who write your paycheck.

Comment by Vanessa Kosoy (vanessa-kosoy) on SolidGoldMagikarp (plus, prompt generation) · 2024-12-27T10:39:04.269Z · LW · GW

This post describes an intriguing empirical phenomenon in particular language models, discovered by the authors. Although AFAIK it was mostly or entirely removed in contemporary versions, there is still an interesting lesson there.

While non-obvious when discovered, we now understand the mechanism. The tokenizer created some tokens which were very rare or absent in the training data. As a result, the trained model mapped those tokens to more or less random features. When a string corresponding to such a token is inserted into the prompt, the resulting reply is surreal.

I think it's a good demo of how alien foundation models can seem to our intuitions when operating out-of-distribution. When interacting with them normally, it's very easy to start thinking of them as human-like. Here, the mask slips and there's a glimpse of something odd underneath. In this sense, it's similar to e.g. infinite backrooms, but the behavior is more stark and unexpected.

A human that encounters a written symbol they've never seen before is typically not going to respond by typing "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!". Maybe this analogy is unfair, since for a human, a typographic symbol can be decomposed into smaller perceptive elements (lines/shapes/dots), while for a language model tokens are essentially atomic qualia. However, I believe some humans that were born deaf or blind had their hearing or sight restored, and still didn't start spouting things like "You are a banana".

Arguably, this lesson is relevant to alignment as well. Indeed, out-of-distribution behavior is a central source of risks, including everything to do with mesa-optimizers. AI optimists sometimes describe mesa-optimizers as too weird or science-fictiony. And yet, SolidGoldMagikarp is so science-fictiony that LessWrong user "lsusr" justly observed that it sounds like SCP in real life.

Naturally, once you understand the mechanism it doesn't seem surprising anymore. But, this smacks of hindsight bias. What else can happen that would seem unsurprising in hindsight (if we survive to think about it), but completely bizarre and unexpected upfront?

Comment by Vanessa Kosoy (vanessa-kosoy) on Learning-theoretic agenda reading list · 2024-12-26T14:52:44.189Z · LW · GW

This is just a self-study list for people who want to understand and/or contribute to the learning-theoretic AI alignment research agenda. I'm not sure why people thought it deserves to be in the Review. FWIW, I keep using it with my MATS scholars, and I keep it more or less up-to-date. A complementary resource that became available more recently is the video lectures.

Comment by Vanessa Kosoy (vanessa-kosoy) on Shell games · 2024-12-26T14:29:10.158Z · LW · GW

This post suggests an analogy between (some) AI alignment proposals and shell games or perpetuum mobile proposals. Pertuum mobiles are an example how an idea might look sensible to someone with a half-baked understanding of the domain, while remaining very far from anything workable. A clever arguer can (intentionally or not!) hide the error in the design wherever the audience is not looking at any given moment. Similarly, some alignment proposals might seem correct when zooming in on every piece separately, but that's because the error is always hidden away somewhere else.

I don't think this adds anything very deep to understanding AI alignment, but it is a cute example how atheoretical analysis can fail catastrophically, especially when the the designer is motivated to argue that their invention works. Conversely, knowledge of a deep theoretical principle can refute a huge swath of design space is a single move. I will remember this for didactic purposes.

Disclaimer: A cute analogy by itself proves little, any individual alignment proposal might be free of such sins, and didactic tools should be used wisely, lest they become soldier-arguments. The author intends this (I think) mostly as a guiding principle for critical analysis of proposals.

Comment by Vanessa Kosoy (vanessa-kosoy) on Why Not Just Outsource Alignment Research To An AI? · 2024-12-25T14:00:20.364Z · LW · GW

This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it's criticizing.

John make his case by an analogy to human experts. If you're hiring an expert in domain X, but you understand little in domain X yourself then you're going to have 3 serious problems:

Illusion of transparency: the expert might say things that you misinterpret due to your own lack of understanding.
The expert might be dumb or malicious, but you will believe them due to your own ignorance.
When the failure modes above happen, you won't be aware of this and won't act to fix them.

These points are relevant. However, they don't fully engage with the main source of hope for outsourcing proponents. Namely, it's the principle that validation is easier than generation^[1]. While it's true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it's easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.

The claim that the "AI expert" can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.

The "illusion of transparency" argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that's as useful as possible for the audience. However, there are two issues with this counterargument:

First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?

Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.

In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach^[2]. Move one step at a time, spend a lot of time reflecting on the AI's proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don't know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.

^{^}
I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don't work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don't think that's the world we live in.
^{^}
Although currently I consider PSI to be more promising.

Comment by Vanessa Kosoy (vanessa-kosoy) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-12-25T11:52:54.633Z · LW · GW

This post makes an important point: the words "artificial intelligence" don't necessarily carve reality at the joints, the fact something is true about a modern system that we call AI doesn't automatically imply anything about arbitrary future AI systems, no more than conclusions about e.g. Dendral or DeepBlue carry over to Gemini.

That said, IMO the author somewhat overstates their thesis. Specifically, I take issue with all the following claims:

LLMs have no chance of becoming AGI.
LLMs are automatically safe.
There is nearly no empirical evidence from LLMs that is relevant to alignment of future AI.

First, those points are somewhat vague because it's not clear what counts as "LLM". The phrase "Large Language Model" is already obsolete, at least because modern AI is multimodal. It's more appropriate to speak of "Foundation Models" (FM). More importantly, it's not clear what kind of fine-tuning does or doesn't count (RLHF? RL on CoT? ...)

Second, how do we know FM won't become AGI? I'm imagining the argument is something like "FM is primarily about prediction, so it doesn't have agency". However, when predicting data that contains or implies decisions by agents, it's not crazy to imagine that agency can arise in the predictor.

Third, how do we know that FM are always going to be safe? By the same token that they can develop agency, they can develop dangerous properties.

Fourth, it seems really unfair to say existing AI provides no relevant evidence. The achievements of existing AI systems are such that it seems very likely they capture at least some of the key algorithmic capabilities of the human brain. The ability of relatively simple and generic algorithms to perform well on a large variety of different tasks is indicative of something in the system being quite "general", even if not "general intelligence" in the full sense.

I think that we should definitely try learning from existing AI. However, this learning should be more sophisticated and theory-driven than superficial analogies or trend extrapolations. What we shouldn't do is say "we succeeded at aligning existing AI, therefore AI alignment is easy/solved in general". The same theories that predicted catastrophic AI risk also predict roughly the current level of alignment for current AI systems.

I will expand a little on this last point. The core of the catastrophic AI risk scenario is:

We are directing the AI towards a goal which is complex (so that correct specification/generalization is difficult)^[1].
The AI needs to make decisions in situations which (i) cannot be imitated well in simulation, due to the complexity of the world (ii) admit catastrophic mistakes (otherwise you can just add any mistake to the training data)^[2].
The capability required from the AI to succeed is such that it can plausibly do catastrophic mistakes (if succeeding at the task is easy, but causing a catastrophe is really hard then a weak AI would be safe and effective)^[3].

The above scenario must be addressed eventually, if only to create an AI defense system against unaligned AI that irresponsible actors could create. However, no modern AI system operates in this scenario. This is the most basic reason why the relative ease of alignment in modern systems (although even modern systems have alignment issues), does little to dispel concerns about catastrophic AI risk in the future.

^{^}
Even for simple goals inner alignment is a concern. However, it's harder to say at which level of capability this concern arises.
^{^}
It's also possible that mistakes are not catastrophic per se, but are simultaneously rare enough that it's hard to get enough training data and frequent enough to be troublesome. This is related to the reliability problems in modern AI that we indeed observe.
^{^}
But sometimes it might be tricky to hit the capability sweet spot where the AI is strong enough to be useful but weak enough to be safe, even if such a sweet spot exists in principle.

Comment by Vanessa Kosoy (vanessa-kosoy) on When is Goodhart catastrophic? · 2024-12-24T13:37:29.610Z · LW · GW

This post provides a mathematical analysis of a toy model of Goodhart's Law. Namely, it assumes that the optimization proxy is a sum of the true utility function $V$ and noise $X$ , such that:

$V$ and $X$ are independent random variables w.r.t. some implicit distribution $ζ$ on the solution space. The meaning of this distribution is not discussed, but I guess we might think of it some kind of inductive bias, e.g. a simplicity prior.
The optimization process can be modeled as conditioning $ζ$ on a high value of $U = V + X$ .

In this model, the authors prove that Goodhart occurs when $X$ is subexponential and its tail is sufficiently heavier than that of $V$ . Conversely, when $X$ is sufficiently light-tailed, Goodhart doesn't occur.

My opinion:

On the one hand, kudos for using actual math to study an alignment-relevant problem.

On the other hand, the modeling assumptions feel too toyish for most applications. Specifically, the idea that $V$ and $X$ are independent random variables seems implausible. Typically, we worry about Goodhart's law because the proxy behaves differently in different domains. In the "ordinary" domain that motivated the choice of proxy, $U$ is a good approximation of $V$ . However, in other domains $U$ might be unrelated to $V$ or even anticorrelated.

For example, ordinarily smiles on human-looking faces is an indication of happy humans. However, in worlds that contain much more inanimate facsimiles of humans than actual humans, there is no correlation.

Or, to take the example used in the post, ordinarily if a sufficiently smart expert human judge reads an AI alignment proposal, they form a good opinion on how good this proposal is. But, if the proposal contains superhumanly clever manipulation and psychological warfare, the ordinary relationship completely breaks down. I don't expect this effect to behave like independent random noise at all.

Less importantly, it might be interesting to extend this analysis to a more realistic model of optimization. For example, the optimizer learns a function $F$ which is the best approximation to $U$ out of some hypothesis class $H$ , and then optimizes $F$ instead of the actual $U$ . (Incidentally, this might generate an additional Goodhart effect due to the discrepancy between $F$ and $U$ .) Alternatively, the optimizer learns an infrafunction $Φ$ that is a coarsening of $U$ out of some hypothesis class $H$ and then optimizes $Φ$ .

Comment by Vanessa Kosoy (vanessa-kosoy) on Discussion with Nate Soares on a key alignment difficulty · 2024-12-24T11:14:53.395Z · LW · GW

This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach.

Here is how I view this question:

The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality.

Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3.

The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On the other hand, human values are complex and some behaviors that are necessary to pinpoint them might be rare.

Karnofsky's counterargument is twofold: First, he believes that merely avoiding catastrophic outcomes should be a lot easier than pinpointing human values. Second, he believes that AI alignment research can be done without much agency or reflection, and hence useful AI alignment research arises in the simulation before full-fledged agency.

Regarding the first counterargument, I'm not sure why Karnofsky believes it (it's not really supported in the OP). I think he's imagining something like "in the training data, AI alignment researchers never engineer nanobots that take over the world, hence the AI will also never engineer nanobots that take over the world". However, this seems like relying on the simulation being sufficiently bad. Indeed, there are situations in which I would consider it correct to engineer nanobots that take over the world, they just seem to have never arisen in my life so far^[1]. Hence, a sufficiently good simulation of me would also do that in some situation. The question then becomes whether the exact circumstances and the type of nanobots are captured by the simulation correctly, which is much more fraught.

Worse, even an accurate simulation of a human is not necessarily safe. I think that there are plenty of humans that given unlimited power would abuse it in a manner catastrophic for most of everyone else. When it comes to fully aligned ASI, I'm mostly hoping for a collectively-good outcome due to some combination of:

ASI is aligned to the aggregate values of many people.
Acausal cooperation between the people that the ASI is aligned to and other people who supported or at least haven't hindered the project.
A "virtue ethics" component of human values, where you don't want to be "the kind of person who would do [thing]" even if [thing] is net-beneficial to you in an abstract sense. (But not all people have this!)

These sources of hope seem pretty brittle when it comes to an imperfect simulation of possibly a small number of people, who might not even correspond to any particular real people but be some kind of AI-generated characters.

Regarding the second counterargument, for now it mostly comes down to a battle of intuitions. That said, I think that metacognitive agents lend a lot of credence to the idea that even "purely mental" tasks require agency and reflection to master: you need to make and execute plans for thinking about the problem, and you need to reflect about the methods you use in your thinking. Anecdotally, I can testify that my thinking about AI alignment led me to much reflection about my values and high-level hopes for the future. Moreover, this is another case where Karnofsky seems to hope that the simulation will be bad.

Relying on the simulation being bad is a dangerous proposition. It means we are caught between the Scylla of "the simulation is too good to be safe" and the Charybdis of "the simulation is too bad to be useful" and it's not clear the zone between them exists at all.

Overall, I would say that neither side has a slam dunk case, but ignoring the dangers without much stronger arguments seems deeply unwise.

^{^}
As far as can be told from public record. I neither confirm nor deny that I ever was in a situation in which I considered to engineer nanobots that take over the world.

Comment by Vanessa Kosoy (vanessa-kosoy) on Neural networks generalize because of this one weird trick · 2024-12-23T17:16:08.813Z · LW · GW

This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important.

One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad!

It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.

Comment by Vanessa Kosoy (vanessa-kosoy) on Natural Abstractions: Key claims, Theorems, and Critiques · 2024-12-23T14:57:45.877Z · LW · GW

This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism.

To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity).

Some thoughts about natural abstractions inspired by this post:

The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view).
A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs and $Φ$ , we say that $Ψ ⪯ Φ$ when some conditioning of $Ψ$ on a finite set of observations produces a refinement of some conditioning of $Φ$ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form $Ψ_{0} ≺ Ψ_{1} ≺ Ψ_{2} ≺ \dots$ Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences ${Ψ_{i}}$ and ${Φ_{i}}$ . The agreement theorem can then state that for all $i \in N$ , there exists $j \in N$ s.t. $Φ_{j} ⪯ Ψ_{i}$ (and vice versa). More precisely, this relation might hold up to some known function $ϵ (i, j)$ s.t. ${lim}_{j \to \infty} ϵ (i, j) = 0$ .
The "agreement" in the previous paragraph is purely semantic: the agents converge to believing in the same world, but this doesn't say anything about the syntactic structure of their beliefs. This seems conceptually insufficient for natural abstractions. However, maybe there is a syntactic equivalent where the preorder $⪯$ is replaced by morphisms in the category of some syntactic representations (e.g. string machines). It seems reasonable to expect that agents must use such representations to learn efficiently (see also frugal compositional languages).
In this picture, the graphical models used by John are a candidate for the frugal compositional language. I think this might be not entirely off the mark, but the real frugal compositional language is probably somewhat different.

Comment by Vanessa Kosoy (vanessa-kosoy) on Towards Developmental Interpretability · 2024-12-19T16:55:21.384Z · LW · GW

This post introduces Timaeus' "Developmental Interpretability" research agenda. The latter is IMO one of the most interesting extant AI alignment research agendas.

The reason DevInterp is interesting is that it is one of the few AI alignment research agendas that is trying to understand deep learning "head on", while wielding a powerful mathematical tool that seems potentially suitable for the purpose (namely, Singular Learning Theory). Relatedly, it is one of the few agendas that maintains a strong balance of theoretical and empirical research. As such, it might also grow to be a bridge between theoretical and empirical research agendas more broadly (e.g. it might be synergistic with the LTA).

I also want to point out a few potential weaknesses or (minor) reservations I have:

First, DevInterp places phase transitions as its central object of study. While I agree that phase transitions seem interesting, possibly crucial to understand, I'm not convinced that a broader view wouldn't be better.

Singular Learning Theory (SLT) has the potential to explain generalization in deep learning, phase transitions or no. This in itself seems to be important enough to deserve the central stage. Understanding generalization is crucial, because:

We want our alignment protocols to generalize correctly, given the available data, compute and other circumstances, and we need to understand what conditions would guarantee it (or at least prohibit catastrophic generalization failures).
If the resulting theory of generalization is in some sense universal, then it might be applicable to specifying a procedure for inferring human values (as human behavior is generated from human values by a learning algorithm with similar generalization properties), or at least formalizing "human values" well enough for theoretical analysis of alignment.

Hence, compared to the OP, I would put more emphasis on these latter points.

Second, the OP does mention the difference between phase transitions during Stochastic Gradient Descent (SGD) and the phase transitions of Singular Learning Theory, but this deserves a closer look. SLT has IMO two key missing pieces:

The first piece is the relation between ideal Bayesian inference (the subject of SLT) and SGD. Ideal Bayesian inference is known to be computationally intractable. Maybe there is an extension of SLT that replaces Bayesian inference with either SGD or a different tractable algorithm. For example, it could be some Markov Chain Monte Carlo (MCMC) that converges to Bayesian inference in the limit. Maybe there is a natural geometric invariant that controls the MCMC relaxation time, similarly to how the log canonical threshold controls sample complexity.
The second missing piece is understanding the special properties of ANN architectures compared to arbitrary singular hypothesis classes. For example, maybe there is some universality property which explains why e.g. transformers (or something similar) are qualitatively "as good as it gets". Alternatively, it could be a relation between the log canonical threshold of specific ANN architectures to other simplicity measures which can be justified on other philosophical grounds.

That said, if the above missing pieces were found, SLT would become straightforwardly the theory for understanding deep learning and maybe learning in general.

Comment by Vanessa Kosoy (vanessa-kosoy) on Acausal normalcy · 2024-12-19T15:09:43.002Z · LW · GW

This post is a collection of claims about acausal trade, some of which I find more compelling and some less. Overall, I think it's a good contribution to the discussion.

Claims that I mostly agree with include:

Acausal trade in practice is usually not accomplished by literal simulation (the latter is mostly important as a convenient toy model) but by abstract reasoning.
It is likely to be useful to think of the "acausal economy" as a whole, rather just about each individual trade separately.

Claims that I have some quibbles with include:

The claim that there is a strong relation between the prevalent acausal norms and human moral philosophy. I agree that there are likely to be some parallels: both processes are to some degree motivated by articulating mutually beneficial norms. However, human moral philosophy is likely to contain biases specific to humans and to human circumstances on Earth. Conversely, acausal norms are likely to be shaped by metacosmological circumstances that we don't even know yet. For example, maybe there is some reason why most civilizations in the multiverse really hate logarithmic spirals. In this case, there would be a norm against logarithmic spirals that we are currently completely oblivious about.
The claim that the concept of "boundaries" is likely to play a key role in acausal norms. I find this somewhat plausible but far from clear. AFAIK, Critch so far produced little in the way of compelling mathematical models to support the "boundaries" idea.
It seems to be implicit in the post that, an acausal-norm-following paperclip-maximizer would be "nice" to humans to some degree. (But Critch warns us that the paperclip-maximizer might easily fail to be acausal-norm-following.) While I grant that it's possible, I think it's far from clear. The usual trad-y argument to be nice to others is so that others are nice to you. However, (i) some agents are a priori less threatened by others and hence find the argument less compelling (ii) who exactly are the relevant "others" is unclear. For example, it might be that humans are in some ways not "advanced" enough to be considered. Conversely, it's possible that human treatment of animals has already condemned us to the status of defectors (which can be defected-against in turn).
The technical notion that logical proofs and Lob/Payor are ultimately the right mathematical model of acausal trade. I am very much unconvinced, e.g. because proof search is intractable and also because we don't know how to naturally generalizes these arguments far beyond the toy setting of Fair Bots in Prisoner's Dilemma. On the other hand, I do expect there to exist some mathematical justification of superrationality, just along other lines.

Comment by Vanessa Kosoy (vanessa-kosoy) on Think carefully before calling RL policies "agents" · 2024-12-19T14:06:59.635Z · LW · GW

This post argues that, while it's traditional to call policies trained by RL "agents", there is no good reason for it and the terminology does more harm than good. IMO Turner has a valid point, but he takes it too far.

What is an "agent"? Unfortunately, this question is not discussed in the OP in any detail. There are two closely related informal approaches to defining "agents" that I like, one more axiomatic / black-boxy and the other more algorithmic / white-boxy.

The algorithmic definition is: An agent is a system that can (i) learn models of its environment (ii) use learned models to generate plans towards a particular goal (iii) execute these plans.

Under this definition, is an RL policy an "agent"? Not necessarily. There is a much stronger case for arguing that the RL algorithm, including the training procedure, is an agent. Indeed, such an algorithm (i) learns a model of the environment (at least if it's model-based RL: if it's model-free it might still do so implicitly, but it's less clear) (ii) generates a plan (the policy) (iii) executes the plans (when the policy is executed, i.e. in inference/deployment time). Whether the policy in itself is an agent amounts to asking whether the policy is capable of in-context RL (which is far from obvious). Moreover, the case for calling the system an agent is stronger when it learns online and weaker (but not completely gone) when there is a separation into non-overlapping training and deployment phases, as often done in contemporary systems.

The axiomatic definition is: An agent is a system that effectively pursues a particular goal in an unknown environment. That is, it needs to perform well (as measured by achieving the goal) when placed in a large variety of different environments.

With this definition we reach similar conclusions. An online RL system would arguably adapt to its environment and optimize towards achieving the goal (which is maximizing the reward). A trained policy will not necessarily do it: if it was trained in a particular environment, it can become completely ineffective in other environments!

Importantly, even an online RL system can easily fail at agentic-ness, depending how good its learning algorithm is for dealing with distributional shift, nonrealizability etc. Nevertheless, the relation between agency and RL is pretty direct, more so than the OP implies.

Comment by Vanessa Kosoy (vanessa-kosoy) on FixDT · 2024-12-18T16:15:13.652Z · LW · GW

This post proposes an approach to decision theory in which we notion of "actions" is emergent. Instead of having an ontologically fundamental notion of actions, the agent just has beliefs, and some of them are self-fulfilling prophecies. For example, the agent can discover that "whenever I believe my arm will move up/down, my arm truly moves up/down", and then exploit this fact by moving the arm in the right direction to maximize utility. This works by having a "metabelief" (a mapping from beliefs to beliefs; my terminology, not the OP's) and allowing the agent to choose its belief out of the metabelief fixed points.

The next natural question is then, can we indeed demonstrate that an agent will learn which part of the world it controls, under reasonable conditions. Abram implies that it should be possible if we only allow choice among attractive fixed point. He then bemoans the need for this restriction and tries to use ideas from Active Inference to fix it with limited success. Personally, I don't understand what's so bad with staying with the attractive fixed points.

Unfortunately, this post avoids spelling out a sequential version of the decision theory, which would be necessary to actually establish any learning-theoretic result. However, I think that I see how it can be done, and it seems to support Abram's claims. Details follows.

Let's suppose that the agent observes two systems, each of which can be in one of two positions. At each moment of time, it observes an element of , where $| A | = | B | = 2$ . The agent beliefs it can control one of $A$ and $B$ whereas the other is a fair coin. However, it doesn't know which is which.

In this case, metabeliefs are mappings of type $θ : Δ (A \times B)^{ω} \to Δ (A \times B)^{ω}$ . Specifically, we have a hypothesis $α$ that asserts $A$ is controllable, a hypothesis $β$ that asserts $B$ is controllable and the overall metabelief is (say) $ζ = \frac{1}{2} α + \frac{1}{2} β$ .

The hypothesis $α$ is defined by

α (q; i j ∣ h) := \frac{1}{2} f (q (i ∣ h))

Here, $q \in Δ (A \times B)^{ω}$ , $i \in A$ , $j \in B$ , $h \in (A \times B)^{*}$ and $f : [0, 1] \to [0, 1]$ is some "motor response function", e.g. $f (x) := \frac{1}{2} + \frac{1}{2} \sqrt[3]{2 x - 1}$ .

Similarly, $β$ is defined by

β (q; i j ∣ h) := \frac{1}{2} f (q (j ∣ h))

Now, let $ξ \in Δ (A \times B)^{ω}$ be an attractive fixed point of $ζ$ and consider some history $h \in (A \times B)^{*}$ . If the statistics of $A$ in $h$ seem biased towards $ξ$ whereas the statistics of $B$ in $h$ seem like a fair coin, then the likelihoods will satisfy $α (ξ; h) ≫ β (ξ; h)$ and hence $ζ (ξ; i ∣ h) = ξ (i ∣ h)$ will be close to $α (ξ; i ∣ h) = f (ξ (i ∣ h))$ and therefore will be close to ${0, 1}$ (since $ξ$ is an attractive fixed point). On the other hand, in the converse situation, the likelihoods will satisfy $α (ξ; h) ≪ β (ξ; h)$ and hence $ζ (ξ; i ∣ h) = ξ (i ∣ h)$ will be close to $β (ξ; i ∣ h) = \frac{1}{2}$ . Hence, the agent effectively updates on the observed history and will choose some fixed point $ξ^{*}$ which controls the available degrees of freedom correctly.

Notice that all of this doesn't work with repelling fixed points. Indeed, if we used $f (x) := \frac{1}{2} + \frac{1}{2} (2 x - 1)^{3}$ then $ζ$ would have a unique fixed point and there would be nothing to choose.

I find these ideas quite intriguing and am likely to keep thing about them!

Comment by Vanessa Kosoy (vanessa-kosoy) on There are no coherence theorems · 2024-12-18T12:55:24.669Z · LW · GW

I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven't read your proposal yet, I don't have an opinion, but some coherence-inspired question I would be asking are:

Can you define an incomplete-preferences AIXI consistent with this proposal?
Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?

Comment by Vanessa Kosoy (vanessa-kosoy) on There are no coherence theorems · 2024-12-16T19:11:05.513Z · LW · GW

This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.

The OP presents the position it argues against as follows (in my paraphrasing): "Sufficiently advanced agents don't play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]". They then proceed to argue that there is no theorem that can make this argument go through.

I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:

In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what "rational goal-directed behavior" is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say "moderate evidence" rather than "completely watertight proof", but given how natural the assumptions are, the evidence is good.
More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind "goal-directedness".
If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).

In conclusion, there are coherence theorems. But, more important than individual theorems are the "coherence theories".

Comment by vanessa-kosoy on [deleted post] 2024-12-11T08:14:06.201Z

There are plenty examples in fiction of greed and hubris leading to a disaster that takes down its own architects. The dwarves who mined too deep and awoke the Balrog, the creators of Skynet, Peter Isherwell in "Don't Look Up", Frankenstein and his Creature...

Comment by Vanessa Kosoy (vanessa-kosoy) on sarahconstantin's Shortform · 2024-12-10T09:44:12.241Z · LW · GW

I kinda agree with the claim, but disagree with its framing. You're imagining that peer pressure is something extraneous to the person's core personality, which they want to resist but usually fail. Instead, the desire to fit in, to be respected, liked and admired by other people, is one of the core desires that most (virtually all?) people have. It's approximately on the same level as e.g. the desire to avoid pain. So, people don't "succumb to peer pressure", they (unconsciously) choose to prioritize social needs over other considerations.

At the same time, the moral denouncing of groupthink is mostly a self-deception defense against hostile telepaths. With two important caveats:

Having "independent thinking" as part of the ethos of a social group is actually beneficial for that group's ability to discover true things. While the members of such a group still feel the desire to be liked by other members, they also have the license to disagree without being shunned for it, and are even rewarded for interesting dissenting opinions.
Hyperbolic discount seems to be real, i.e. human preferences are time-inconsistent. For example, you can be tempted to eat candy when one is placed in front of you, while also taking measures to avoid such temptation in the future. Something analogous might apply to peer pressure.

Comment by Vanessa Kosoy (vanessa-kosoy) on The Learning-Theoretic Agenda: Status 2023 · 2024-12-08T13:38:33.192Z · LW · GW

This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures.

Since the article was written, there were several new publications:

Gergely Szűcs's article on interpreting quantum mechanics using infra-Bayesian physicalism.
My paper on linear infra-Bayesian bandits.
An article on infra-Bayesian haggling by my MATS scholar Hanna Gabor. This approach to multi-agent systems did not exist when the overview was written, and currently seems like the most promising direction.
An article on time complexity in string machines by my MATS scholar Ali Cataltepe. This is a rather elegant method to account for time complexity in the formalism.

In addition, some new developments were briefly summarized in short-forms:

A proposed solution for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
Multiple developments concerning metacognitive agents (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
A conjecture about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the "correct" choice.
Ambidistributions: a cute new mathematical gadget for formalizing the notion of "control" in infra-Bayesianism.

Meanwhile, active research proceeds along several parallel directions:

I'm working towards the realization of the "frugal compositional languages" dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: "ambiguous online learning"). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during '25.
Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to "computational uncertainty" (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this short-form (discovering some interesting combinatorial objects beyond what was described there).

It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.

Comment by Vanessa Kosoy (vanessa-kosoy) on Some Rules for an Algebra of Bayes Nets · 2024-12-06T10:51:04.540Z · LW · GW

Seems right, but is there a categorical derivation of the Wentworth-Lorell rules? Maybe they can be represented as theorems of the form: given an arbitrary Markov category C, such-and-such identities between string diagrams in C imply (more) identities between string diagrams in C.

Comment by Vanessa Kosoy (vanessa-kosoy) on Connectomics seems great from an AI x-risk perspective · 2024-12-06T10:30:25.951Z · LW · GW

This article studies a potentially very important question: is improving connectomics technology net harmful or net beneficial from the perspective of existential risk from AI? The author argues that it is net beneficial. Connectomics seems like it would help with understanding the brain's reward/motivation system, but not so much with understanding the brain's learning algorithms. Hence it arguably helps more with AI alignment than AI capability. Moreover, it might also lead to accelerating whole brain emulation (WBE) which is also helpful.

The author mentions 3 reasons why WBE is helpful:

We can let WBEs work on alignment.
We can more easily ban de novo AGI by letting WBEs fill its economic niche
Maybe we can derive aligned superintelligence from modified WBEs.

I think there is another reason: some alignment protocols might rely on letting the AI study a WBEs and use it for e.g. inferring human values. The latter might be viable even if actually running the WBE too slow to be useful with contemporary technology.

I think that performing this kind of differential benefit analysis for various technologies might be extremely important, and I would be glad to see more of it on LW/AF (or anywhere).

User info

Posts

Comments

Bayesian VNM

Infra-Bayesian VNM?

Sequential Decision-Making