New Paper: Infra-Bayesian Decision-Estimation Theory

vanessa-kosoy

New Paper: Infra-Bayesian Decision-Estimation Theory

post by Vanessa Kosoy (vanessa-kosoy), Diffractor · 2025-04-10T09:17:38.966Z · LW · GW · 4 comments

This is a link post for https://arxiv.org/abs/2504.06820

4 comments

Diffractor is the first author of this paper.
Official title: "Regret Bounds for Robust Online Decision Making"

Abstract: We propose a framework which generalizes "decision making with structured observations" by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be nonoblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework. Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits and tabular robust online reinforcement learning. In both cases, we derive regret bounds that improve state-of-the-art (except that we do not address computational efficiency).

In our new paper, we generalize Foster et al's theory of "decision-estimation coefficients" to the "robust" (infa-Bayesian [AF · GW]) setting. The former is the most general known theory of regret bounds for multi-armed bandits and reinforcement learning, which comes close to giving tight bounds for all "reasonable" hypothesis classes. In our work, we get an analogous theory, even though our bounds are not quite as tight.

Remarkably, the result also establishes a tight connection between infra-Bayesianism and Garrabrant induction. Specifically, the algorithm which demonstrates the upper bound works by computing beliefs in a Garrabrant-induction-like manner^[1], and then acting on these beliefs via an appropriate trade-off between infra-Bayesian exploitation and exploration (defined using the "decision-estimation" approach).

It seems quite encouraging that the two different theories which came out of thinking about "logical uncertainty [? · GW]" (infra-Bayesianism and Garrabrant induction) can be unified in this manner^[2], boosting our confidence that we are on the right path.

The paper assumes no prior familiarity with either infra-Bayesianism or Garrabrant induction.

^{^}
In the start of each episode.
^{^}
Although we think that a fuller treatment of logical uncertainty requires introducing metacognition [AF · GW].

4 comments

Comments sorted by top scores.

comment by faul_sname · 2025-04-11T00:13:19.516Z · LW(p) · GW(p)

Alright, as I've mentioned before [LW(p) · GW(p)] I'm terrible at abstract thinking, so I went through the post and came up with a concrete example. Does this seem about right?

We are running a quantitative trading firm, we care about the closing prices of the S&P 500 stocks. We have a forecasting system which is designed in a robust, "prediction market" style. Instead of having each model output a single precise forecast, each model

outputs a set of probability distributions over tomorrow’s closing prices. That is, for each action $a$ (for example, “buy” or “sell” a particular stock, or more generally “execute a trade decision”), our model returns a set

$M j (a) \subseteq Δ (F u t u r e P r i c e s), M^{j} (a) \subseteq Δ (Future Prices),$

which means that $M^{j} (a)$ is a nonempty, convex set of probability distributions over outcomes.

This definition allows "nature" (the market) to choose the distribution that actually occurs from within the set provided by our model. In our robust framework, we assume that the market might select the worst-case distribution in $M^{j} (a)$ . In other words, our multivalued model is a function

$M : A \to □ (O), M : A \to □ (O),$

with $□ (O)$ representing the collection of sets of probability distributions over outcomes. Instead of pinning down exactly what tomorrow’s price will be if we take a particular action, each model provides us a “menu” of plausible distributions over possible closing prices.

We do this because there is some hidden-to-us world state that we cannot measure directly, and this state might even be controlled adversarially by market forces. Instead of trying to infer or estimate the exact hidden state, we posit that there are a finite number of plausible candidates for what this hidden state might be. For each candidate hidden state, we associate one probability distribution over the closing prices. Thus, when we look at the model for an action, rather than outputting a single forecast, the model gives us a “menu” of distributions, each corresponding to one possible hidden state scenario. In this way, we deliberately refrain from committing to a single prediction about the hidden state, thereby preparing for a worst-case (or adversarial) realization.

Given a bettor $B$ , MDP $M$ and policy $π$ we define ${bet}_{M, π}^{B} : ([0, 1] \times (states \times acts \times [0, 1])^{[H]}) \to R$ , the aggregate betting function, as
${bet}_{M, π}^{B} (r_{0}, s_{1}, a_{1}, . . . r_{H}) := 1 + \sum_{h = 0}^{H} ({lbet}_{M, π}^{B, h, s_{h}, a_{h}} (r_{h}, s_{h + 1}, a_{h + 1}) - 1)$
Where $t r_{t}$ and $π_{t}$ are the trajectory in episode $t$ and policy in episode $t$ , respectively.

Next, our trading system aggregates these imprecise forecasts via a prediction-market-like mechanism. Inside our algorithm we maintain a collection of “bettors”. Each "bettor" (besides the pessimistic and uniform bettors, which do the obvious thing their names imply) corresponds to one of our underlying models (or to aspects of a model). Each bettor $B$ is associated with its own preferred prediction (derived from its hypothesis set) and a current “wealth” (i.e. credibility). Instead of simply choosing one model, every bettor places a bet based on (?)how well our market prediction aligns with its own view(?),

$Robust Universal Estimator (RUE)$
$Parameters: Hypothesis class H, rounds T, reward function r, prior ζ_{1}$
$ϵ \leftarrow min (\frac{1}{2}, \sqrt{\frac{ln (2)}{T}})$
$Function estimate(ζ, a):$
$r e t u r n argmin μ \in Δ O E B \sim ζ [if B \neq ∙, 2 D_{H}^{2} (μ \to M_{B} (a)), else ε \cdot E o \sim μ [r (a, o)]]$
$Function update(ζ, ¯ ¯¯¯¯ ¯ M, a, o):$
$f o r B \in H \cup {u} :$
$μ_{B} \leftarrow {argmin}_{μ \in M_{B} (a)} D_{H}^{2} (¯ ¯¯¯¯ ¯ M (a), μ)$
$ξ (B) \leftarrow ζ (B) \cdot (\sqrt{\frac{μ_{B} (o)}{¯ ¯¯¯ ¯ M (a) (o)}} + D_{H}^{2} (¯ ¯¯¯¯ ¯ M (a) \to M_{B} (a)))$
$ξ (∙) \leftarrow ζ (∙) \cdot (1 + ε (E_{o^{'} \sim ¯ ¯¯¯ ¯ M (a)} [r (a, o^{'})] - r (a, o)))$
$return ξ$
$for 1 \leq t \leq T :$
${ˆ M}_{t} \leftarrow λ a . estimate (ζ_{t}, a)$
$return {ˆ M}_{t}$
$Receive a_{t}, o_{t} from the environment$
$ζ_{t + 1} \leftarrow update (ζ_{t}, {ˆ M}_{t}, a_{t}, o_{t})$
$end$

To calculate our market prediction $ˆ M$ , we solve a convex minimization problem that balances the differing opinions of all our bettors (weighted by their current wealth $ζ$ ) in such a way that it (?)minimizes their expected value on update(?).

The key thing here is that we separate the predictable / non-adversarial parts of our environment from the possibly-adversarial ones, and so our market prediction $ˆ M$ reflects our best estimate of the outcomes of our actions if the parts of the universe we don't observer are out to get us.

Is this a reasonable interpretation? If so, I'm pretty interested to see where you go with this.

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2025-04-13T17:41:33.901Z · LW(p) · GW(p)

It's roughly on the right track, but here are some inaccuracies in your description that stood out to me:

There is no requirement that the "hidden state space" is finite. It is perfectly fine to consider a credal set which is not a polytope (i.e. not a convex hull of a finite set of distributions).
The point of how market prices are computed, missing from your description, is that they prevent any bettor from making unbounded earnings (essentially, by making them bet against each other). This is the same principle as Garrabrant induction. In particular, this implies that if any of our models is true then the market predictions will converge to lying inside the corresponding credal set.
The market predictions do not somehow assume that "the parts of the universe we don't observe are out to get us". Thanks to the pessimistic better, they do satisfy the "not too optimistic condition", but that's "not too optimistic" relatively to the true environment.
Your entire description only talks about the "estimation" part, not about the "decision" part.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2025-04-10T09:25:14.781Z · LW(p) · GW(p)

Congratulations on this paper. It seems like a major result.

Any chance of more exposition for those of us less cognitively-inclined? =)

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2025-04-10T09:33:38.150Z · LW(p) · GW(p)

Thank you <3

Any chance of more exposition for those of us less cognitively-inclined? =)

Read the paper! :)

It might seem long at first glance, but all the results are explained in the first 13 pages, the rest is just proofs. If you don't care about the examples, you can stop on page 11. Naturally, I welcome any feedback on the exposition there.

New Paper: Infra-Bayesian Decision-Estimation Theory

Contents

4 comments