Is Infra-Bayesianism Applicable to Value Learning?

post by RogerDearnaley (roger-d-1) · 2023-05-11T08:17:55.470Z · LW · GW · 4 comments

This is a question post.

My impression, as someone just starting to learn Infra-Bayesianism, is that it's about caution, lower bounds on utility (which is exactly the way anything trying to overcome the Optimizer's Curse should be reasoning [LW · GW], especially in an environment already heavily optimized by humans where utility will have a lot more downside than upside uncertainty), so the utility score is vital in the argmax min process, and in the relationship between sa-measures and a-measures.

However, this does make it intuitively inobvious how to apply Infra-Bayesianism to Value Learning, where the utility function from physical states of the environment to utility values is initially very uncertain, and is an important part of what the AI is trying to do (Infra-)Bayesian updates on. So, a question for people who already understand Infra-Bayesianism: is it in fact applicable to Value Learning? If so, does it apply in the following way: the (a-priori unknown, likely quite complex, and possibly not fully computable/realizable to the agent) human effective utility function that maps physical states to human (and thus also value-learner-agent) utility values is treated as (an important) part of the environment, and thus the min over environments('Murphy') part of the argmax min process includes making the most pessimistic still-viable assumptions about this?

To ask a follow-on question, if so, would cost-effectively reducing uncertainty in the human effective utility function (i.e. doing research on the alignment problem) to reduce Murphy's future room-to-maneuver on this be a convergent intermediate strategy for any value-learner-agents that were using Infra-Bayesian reasoning? Or would such a system automatically assume that learning more about the human effective utility function is pointless, because they assume Murphy will always ensure that they live in the worst of all possible environments, so decreasing uncertainty on utility will only ever move the upper bound on it not the lower one? 

[I'm trying to learn Infra-Bayesianism, but my math background is primarily from Theoretical Physics, so I'm more familiar with functional analysis, via field-theory Feynman history integrals, than with Pure Math concepts like Banach spaces. So the main Infra-Bayesianism sequence's Pure Math approach is thus rather heavy going for me.] 


answer by Charlie Steiner · 2023-05-13T05:27:32.935Z · LW(p) · GW(p)

Take this with a big grain of salt, but I'll just tell you my impression.

Theoretically, I think it's useful in that it tells us that a lot is possible even in non-realizable settings.

As a guide to practice, I think there's plenty of room to do better. Ideally I'd want a representation that leverages composition of hypotheses with each other, and that natively does its reasoning in a non-extremizing way that makes more sense to humans (even if it's mathematically equivalent to armax/argmin on some function).

Presently I think it's a mistake to identify a good infrabayesian-physicalist utility function with a good human-intuitive-decision-theory utility function - the utility-numbers that get assigned to states don't have to make sense to humans. This is an obstacle to value learning approaches that care about having a feedback loop of human reflection on the AI's value learning process, which I think is important.

comment by Roger Dearnaley · 2023-05-14T11:02:53.925Z · LW(p) · GW(p)

We want our value-learner AI to learn to have the same preference order over outcomes as humans, which requires its goal to be to find (or at least learn to act according to) a utility function as close as possible to some aggregate of ours (if humans actually had utility functions rather than a collection of cognitive biases [LW · GW]) up to an arbitrary monotonically-increasing mapping. We also want its preference order over probability distributions of outcomes to match ours, which requires it to find a utility function that matches ours up to an increasing affine (linear, i.e. scale and shift) transformation. So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2023-05-14T13:43:40.358Z · LW(p) · GW(p)

if humans actually had utility functions

Yeah, humans lack a unique utility function. I know what you mean informally, just don't get bogged down mathematizing something we don't have.

So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

Do you think this is a desideratum, or a guarantee?

I'll say the key point plainly: suppose some policy is "the good policy." Which utility function causes an agent to follow the good policy will be different depending on how the agent makes decisions. For a given "good policy," the utility functions that produce that policy can look weird to humans if worst-case reasoning steps are sprinkled into the agent's decision-making.

Replies from: Roger Dearnaley
comment by Roger Dearnaley · 2023-05-15T00:10:15.741Z · LW(p) · GW(p)

I take your point that the way an Infra-Bayesian system makes decisions isn't the same as a human — it presumably doesn't share our cognitive biases, and the pessimism element 'Murphy' in it seems stronger than for most humans. I normally assume that if there's something I don't understand about the environment that's injecting noise into the outcome of my actions, the noise-related parts of results aren't going to be well-optimized, so they're going to be worse than I could have achieved had I had full understanding, but that even leaving things to chance I may sometimes get some good luck along with the bad — I don't generally assume that everything I can't control will have literally the worst possible outcome. So I guess in Infra-Bayesian terms I'm assuming that Murphy is somewhat constrained by laws that I'm not yet aware of, and may never be aware of.

My take on Murphy is that it's a systematization of the force of entropy trying to revert the environment to a thermodynamic equilibrium state, and of the common fact that the utility of that equilibrium state is usually pretty low. One of the flaws I see in Infra-Bayesianism is that there are sometimes (hard to reach but physically possible) states whose utility to me is even lower than the thermodynamic equilibrium (such as a policy that scores less than 20% on a 5-option multiple choice quiz so does worse than random guessing, or a minefield left over after a war that is actually worse than a blasted wasteland) where increasing entropy would actually help improve things. In a hellworld, randomly throwing money wrenches in the gears is a moderately effective strategy. In those unusual cases Infra-Bayesianism's Murphy no longer aligns with the actual effects of entropy/Knightian uncertainty.


Comments sorted by top scores.