Posts
Comments
This paper discusses two semantics for Bayesian inference in the case where the hypotheses under consideration are known to be false.
- Verisimilitude: p(h) = the probability that that h is closest to the truth [according to some measure of closeness-to-truth] among hypotheses under consideration
- Counterfactual: p(h) = the probability of h given the (false) supposition that one of the hypotheses under consideration is true
In any case, it’s unclear what motivates making decisions by maximizing expected value against such probabilities, which seems like a problem for boundedly rational decision-making.
mildly disapprove of words like "a widely-used strategy"
The text says “A widely-used strategy for arguing for norms of rationality involves avoiding dominated strategies”, which is true* and something we thought would be familiar to everyone who is interested in these topics. For example, see the discussion of Dutch book arguments in the SEP entry on Bayesianism and all of the LessWrong discussion on money pump/dominance/sure loss arguments (e.g., see all of the references in and comments on this post). But fair enough, it would have been better to include citations.
"we often encounter claims"
We did include (potential) examples in this case. Also, similarly to the above, I would think that encountering claims like “we ought to use some heuristic because it has worked well in the past” is commonplace among readers so didn’t see the need to provide extensive evidence.
*Granted, we are using “dominated strategy” in the wide sense of “strategy that you are certain is worse than something else”, which glosses over technical points like the distinction between dominated strategy and sure loss.
What principles? It doesn’t seem like there’s anything more at work here than “Humans sometimes become more confident that other humans will follow through on their commitments if they, e.g., repeatedly say they’ll follow through”. I don’t see what that has to do with FDT, more than any other decision theory.
If the idea is that Mao’s forming the intention is supposed to have logically-caused his adversaries to update on his intention, that just seems wrong (see this section of the mentioned post).
(Separately I’m not sure what this has to do with not giving into threats in particular, as opposed to preemptive commitment in general. Why were Mao’s adversaries not able to coerce him by committing to nuclear threats, using the same principles? See this section of the mentioned post.)
I don't think FDT has anything to do with purely causal interactions. Insofar as threats were actually deterred here this can be understood in standard causal game theory terms. (I.e., you claim in a convincing manner that you won't give in -> People assign high probability to you being serious -> Standard EV calculation says not to commit to threat against you.) Also see this post.
Awesome sequence!
I wish that discussions of anthropics were clearer about metaphysical commitments around personal identity and possibility. I appreciated your discussions of this, e.g., in Section XV. I agree with what you, though, that it is quite unclear what justifies the picture “I am sampled from the set of all possible people-in-my-epistemic situation (weighted by probability of existence)”. I take it the view of personal identity at work here is something like “‘I’ am just a sequence of experiences S”, and so I know I am one of the sequences of experiences consistent with my current epistemic situation E. But the straightforward Bayesian way of thinking about this would seem to be: “I am sampled from all of the sequences of experiences S consistent with E, in the actual world”.
(Compare with: I draw a ball from an urn, which either contains (A) 10 balls or (B) 100 balls, 50% chance each. I don’t say “I am indifferent between the 110 possible balls I could’ve drawn, and therefore it’s 10:1 that this ball came from (B).” I say that with 50%, ball came from (A) and with 50% the ball came from (B). Of course, there may be some principled difference between this and how you want to think about anthropics, but I don’t see what it is yet.)
This is just minimum reference class SSA, which you reject because of its verdict in God’s Coin Toss with Equal Numbers. I agree that this result is counterintuitive. But I think it becomes much more acceptable if (1) we get clear about the notion of personal identity at work and (2) we try to stick with standard Bayesianism. mrcSSA also avoids many of the apparent problems you list for SSA. Overall I think mrcSSA's answer to God's Coin Toss with Equal Numbers is a good candidate for a "good bullet" =).
(Cf. Builes (2020), part 2, who argues that if you have a deflationary view of personal identity, you should use (something that looks extensionally equivalent to) mrcSSA.)
But it's true that if you had been aware from the beginning that you were going to be threatened, you would have wanted to give in.
To clarify, I didn’t mean that if you were sure your counterpart would Dare from the beginning, you would’ve wanted to Swerve. I meant that if you were aware of the possibility of Crazy types from the beginning, you would’ve wanted to Swerve. (In this example.)
I can’t tell if you think that (1) being willing to Swerve in the case that you’re fully aware from the outset (because you might have a sufficiently high prior on Crazy agents) is a problem. Or if you think (2) this somehow only becomes a problem in the open-minded setting (even though the EA-OMU agent is acting according to the exact same prior as they would've if they started out fully aware, once their awareness grows).
(The comment about regular ol exploitability suggests (1)? But does that mean you think agents shouldn't ever Swerve, even given arbitrarily high prior mass on Crazy types?)
What if anything does this buy us?
In the example in this post, the ex ante utility-maximizing action for a fully aware agent is to Swerve. The agent starts out not fully aware, and so doesn’t Swerve unless they are open-minded. So it buys us being able to take actions that are ex ante optimal for our fully aware selves when we otherwise wouldn’t have due to unawareness. And being ex ante optimal from the fully aware perspective seems preferable to me than being, e.g., ex ante optimal from the less-aware perspective.
More generally, we are worried that agents will make commitments based on “dumb” priors (because they think it’s dangerous to think more and make their prior less dumb). And EA-OMU says: No, you can think more (in the sense of becoming aware of more possibilities), because the right notion of ex ante optimality is ex ante optimality with respect to your fully-aware prior. That's what it buys us.
And revising priors based on awareness growth differs from updating on empirical evidence because it only gives other agents incentives to make you aware of things you would’ve wanted to be aware of ex ante.
they need to gradually build up more hypotheses and more coherent priors over time
I’m not sure I understand—isn't this exactly what open-mindedness is trying to (partially) address? I.e., how to be updateless when you need to build up hypotheses (and, as mentioned briefly, better principles for specifying priors).
If I understand correctly, you’re making the point that we discuss in the section on exploitability. It’s not clear to me yet why this kind of exploitability is objectionable. After all, had the agent in your example been aware of the possibility of crazy agents from the start, they would have wanted to swerve, and non-crazy agents would want to take advantage of this. So I don’t see how the situation is any worse than if the agents were making decisions under complete awareness.
Can you clarify what “the problem” is and why it “recurs”?
My guess is that you are saying: Although OM updatelessness may work for propositions about empirical facts, it’s not clear that it works for logical propositions. For example, suppose I find myself in a logical Counterfactual Mugging regarding the truth value of a proposition P. Suppose I simultaneously become aware of P and learn a proof of P. OM updatelessness would want to say: “Instead of accounting for the fact that you learned that P is true in your decision, figure out what credence you would have assigned to P had you been aware of it at the outset, and do what you would have committed to do under that prior”. But, we don’t know how to assign logical priors.
Is that the idea? If so, I agree that this is a problem. But it seems like a problem for decision theories that rely on logical priors in general, not OM updatelessness in particular. Maybe you are skeptical that any such theory could work, though.
The model is fully specified (again, sorry if this isn’t clear from the post). And in the model we can make perfectly precise the idea of an agent re-assessing their commitments from the perspective of a more-aware prior. Such an agent would disagree that they have lost value by revising their policy. Again, I’m not sure exactly where you are disagreeing with this. (You say something about giving too much weight to a crazy opponent — I’m not sure what “too much” means here.)
Re: conservation of expected evidence, the EA-OMU agent doesn’t expect to increase their chances of facing a crazy opponent. Indeed, they aren’t even aware of the possibility of crazy opponents at the beginning of the game, so I’m not sure what that would mean. (They may be aware that their awareness might grow in the future, but this doesn’t mean they expect their assessments of the expected value of different policies to change.) Maybe you misunderstand what we mean by "unawareness"?
For this to be wrong, the opponent must be (with some probability) irrational - that's a HUGE change in the setup
For one thing, we’re calling such agents “Crazy” in our example, but they need not be irrational. They might have weird preferences such that Dare is a dominant strategy. And as we say in a footnote, we might more realistically imagine more complex bargaining games, with agents who have (rationally) made commitments on the basis of as-yet unconceived of fairness principles, for example. An analogous discussion would apply to them.
But in any case, it seems like the theory should handle the possibility of irrational agents, too.
You can't just say "Alice has wrong probability distributions, but she's about to learn otherwise, so she should use that future information". You COULD say "Alice knows her model is imperfect, so she should be somewhat conservative, but really that collapses to a different-but-still-specific probability distribution.
Here’s what I think you are saying: In addition to giving prior mass to the hypothesis that her counterpart is Normal, Alice can give prior mass to a catchall that says “the specific hypotheses I’ve thought of are all wrong”. Depending on the utilities she assigns to different policies given that the catchall is true, then she might not commit to Dare after all.
I agree that Alice can and should include a catchall in her reasoning, and that this could reduce the risk of bad commitments. But that doesn’t quite address the problem we are interested in here. There is still a question of what Alice should do once she becomes aware of the specific hypothesis that the predictor is Crazy. She could continue to evaluate her commitments from the perspective of her less-aware self, or she could do the ex-ante open-minded thing and evaluate commitments from the priors she should have had, had she been aware of the things she’s aware of now. These two approaches come apart in some cases, and we think that the latter is better.
You don't need to bring updates into it, and certainly don't need to consider future updates. https://www.lesswrong.com/tag/conservation-of-expected-evidence means you can only expect any future update to match your priors.
I don’t see why EA-OMU agents should violate conservation of expected evidence (well, the version of the principle that is defined for the dynamic awareness setting).
Thanks Dagon:
Any mechanism to revoke or change a commitment is directly giving up value IN THE COMMON FORMULATION of the problem
Can you say more about what you mean by “giving up value”?
Our contention is that the ex-ante open-minded agent is not giving up (expected) value, in the relevant sense, when they "revoke their commitment" upon becoming aware of certain possible counterpart types. That is, they are choosing the course of action that would have been optimal according to the priors that they believe they should have set at the outset of the decision problem, had they been aware of everything they are aware of now. This captures an attractive form of deference — at the time it goes updateless / chooses its commitments, such an agent recognizes its lack of full awareness and defers to a version of itself that is aware of more considerations relevant to the decision problem.
As we say, the agent does make themselves exploitable in this way (and so “gives up value” to exploiters, with some probability). But they are still optimizing the right notion of expected value, in our opinion.
So I’d be interested to know what, more specifically, your disagreement with this perspective is. E.g., we briefly discuss a couple of alternatives (close-mindedness and awareness growth-unexploitable open-mindedness). If you think one of those is preferable I’d be keen to know why!
This model doesn't seem to really specify the full ruleset that it's optimizing for
Sorry that this isn’t clear from the post. I’m not sure which parts were unclear, but in brief: It’s a sequential game of Chicken in which the “predictor” moves first; the predictor can fully simulate the “agent’s” policy; there are two possible types of predictor (Normal, who best-responds to their prediction, and Crazy, who Dares no matter what); and the agent starts off unaware of the possibility of Crazy predictors, and only becomes aware of the possibility of Crazy types when they see the predictor Dare.
If a lack of clarity here is still causing confusion, maybe I can try to clarify further.
I also suspect you're conflating updates of knowledge with strength and trustworthiness of commitment. It's absolutely possible (and likely, in some formulations about timing and consistency) that a player can rationally make a commitment, and then later regret it, WITHOUT preferring at the time of commitment not to commit.
I’m not sure I understand your first sentence. I agree with the second sentence.
Thanks for sharing, I'm happy that someone is looking into this. I'm not an expert in the area, but my impression is that this is consistent with a large body of empirical work on "procedural fairness", i.e., people tend to be happier with outcomes that they consider to have been generated by a fair decision-making process. It might be interesting to replicate studies from that literature with an AI as the decision-maker.
Done!
[I work at CAIF and CLR]
Thanks for this!
I recommend making it clearer that CAIF is not focused on s-risk and is not formally affiliated with CLR (except for overlap in personnel). While it’s true that there is significant overlap in CLR’s and CAIF’s research interests, CAIF’s mission is much broader than CLR’s (“improve the cooperative intelligence of advanced AI for the benefit of all”), and its founders + leadership are motivated by a variety of catastrophic risks from AI.
Also, “foundational game theory research” isn’t an accurate description of CAIF’s scope. CAIF is interested in a variety of fields relevant to the cooperative intelligence of advanced AI systems. While this includes game theory and decision theory, I expect that a majority of CAIF’s resources (measured in both grants and staff time) will be directed at machine learning, and that we’ll also support work from the social and natural sciences. Also see Open Problems in Cooperative AI and CAIF’s recent call for proposals for a better sense of the kinds of work we want to support.
[ETA] I don’t think “foundational game theory research” is an accurate description of CLR’s scope, either, though I understand how public writing could give that impression. It is true that several CLR researchers have worked and are currently working on foundational game & decision theory research. But people work on a variety of things. Much of our recent technical and strategic work on cooperation is grounded in more prosaic models of AI (though to be fair much of this is not yet public; there are some forthcoming posts that hopefully make this clearer, which I can link back to when they’re up.) Other topics include risks from malevolent actors and AI forecasting.
[Edit 14/9] Some of these "forthcoming posts" are up now.
A few thoughts on this part:
I guess [coordination failures between AIs] feels like mainly the type of thing that we can outsource to AIs, once they’re sufficiently capable. I don’t see a particularly strong reason to think that systems that are comparably powerful as humans, or more powerful than humans, are going to make obvious mistakes in how they coordinate. You have this framing of AI coordination. We could also just say politics, right? Like we think that geopolitics is going to be hard in a world where AIs exist. And when you have that framing, you’re like, geopolitics is hard, but we’ve made a bunch of progress compared with a few hundred years ago where there were many more wars. It feels pretty plausible that a bunch of trends that have led to less conflict are just going to continue. And so I still haven’t seen arguments that make me feel like this particular problem is incredibly difficult, as opposed to arguments which I have seen for why the alignment problem is plausibly incredibly difficult.
I agree that a lot of thinking on how to make AI cooperation go well can be deferred to when we have highly capable AI assistants. But there is still the question of how human overseers will make use of highly capable AI assistants when reasoning about tricky bargaining problems, what kinds of commitments to make and so on. Some of these problems are qualitatively different than the problems of human geopolitics. And I don’t see much reason for confidence that early AIs and their overseers will think sufficiently clearly about this by default, that is, without some conceptual groundwork having been laid going into a world with the first powerful AI assistants. (This and this are examples of conceptual groundwork I consider valuable to have done before we get powerful AI assistants.)
There is also the possibility that we lose control of AGI systems early on, but it’s still possible to reduce risks of worse-than-extinction outcomes due to cooperation failures involving those systems. This work might not be delegable.
(Overall, I agree that thinking specific to AI cooperation should be a smaller part of the existential risk reduction portfolio than generic alignment, but maybe a larger portion than the quote here suggests.)
We are now using a new definition of s-risks. I've edited this post to reflect the change.
New definition:
S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.
Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.
Ok, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe.
It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation of the systems’ domain-specific abilities may be exacerbated by competitive pressures. And of course there is historical precedent for delegation to overconfident military bureaucracies.
On the other hand, to the extent that human leadership is able to correctly assess their systems’ competence in this domain, it may be only because there has been a sufficiently successful AI cooperation research program. For instance, maybe this research program has furnished appropriate simulation environments to probe the relevant aspects of the systems’ behavior, transparency tools for investigating cognition about other AI systems, norms for the resolution of conflicting interests and methods for robustly instilling those norms, etc, along with enough researcher-hours applying these tools to have an accurate sense of how well the systems will navigate conflict.
As for irreversible delegation — there is the question of whether delegation is in principle reversible, and the question of whether human leaders would want to override their AI delegates once war is underway. Even if delegation is reversible, human leaders may think that their delegates are better suited to wage war on their behalf once it has started. Perhaps because things are simply happening so fast for them to have confidence that they could intervene without placing themselves at a decisive disadvantage.
The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.
But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment
I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.
-
The principals largely delegate to AI systems on military decision-making, mistakenly believing that the systems are extremely competent in this domain.
-
The mostly-intent-aligned AI systems, who are actually not extremely competent in this domain, make hair-trigger commitments of the kind described in the OP. The systems make their principals aware of these commitments and (being mostly-intent-aligned) convince their principals “in good faith” that this is the best strategy to pursue. In particular they are convinced that this will not lead to existential catastrophe.
-
The commitments are triggered as described in the OP, leading to conflict. The conflict proceeds too quickly for the principals to effectively intervene / the principals think their best bet at this point is to continue to delegate to the AIs.
-
At every step both principals and AIs think they’re doing what’s best by the respective principals’ lights. Nevertheless, due to a combination of incompetence at bargaining and structural factors (e.g., persistent uncertainty about the other side’s resolve), the AIs continue to fight to the point of extinction or unrecoverable collapse.
Would be curious to know which parts of this story you find most implausible.
Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced otherwise”. Idk.
Your meta-level concern seems warranted. One would at least want to try to formalize the kinds of commitments we’re discussing and ask if they provide any guarantees, modulo equilibrium selection.
It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”
The set of solution concepts you commit to regarding as acceptable still poses an equilibrium selection problem. But, on the face of it at least, the “thinking more” part is handled by conditional commitments to act on the basis of future beliefs.
I guess there’s the problem of what protocols for specifying future world-models you commit to regarding as acceptable. Maybe there are additional protocols that haven’t occurred to you, but which other agents may have committed to and which you would regard as acceptable when presented to you. Hopefully it is possible to specify sufficiently flexible methods for determining whether protocols proposed by your future counterparts are acceptable that this is not a problem.
Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.
A few miscellaneous points:
-
I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.
-
I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares to other sources, but I’d guess that it is somewhat underestimated, because of my impression that folks generally underestimate the difficulty of getting agents to get along (even if they are otherwise highly competent).
Neat post, I think this is an important distinction. It seems right that more homogeneity means less risk of bargaining failure, though I’m not sure yet how much.
Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights
In what ways does having similar architectures or weights help with cooperation between agents with different goals? A few things that come to mind:
- Having similar architectures might make it easier for agents to verify things about one another, which may reduce problems of private information and inability to credibly commit to negotiated agreements. But of course increased credibility is a double-edged sword as far as catastrophic bargaining failure is concerned, as it may make agents more likely to commit to carrying out coercive threats.
- Agents with more similar architectures / weights will tend to have more similar priors / ways of modeling their counterparts and as well as notions of fairness in bargaining, which reduces risk of bargaining failure . But as systems are modified or used to produce successor systems, they may be independently tuned to do things like represent their principal in bargaining situations. This tuning may introduce important divergenes in whatever default priors or notions of fairness were present in the initial mostly-identical systems. I don’t have much intuition for how large these divergences would be relative to those in a regime that started out more heterogeneous.
- If a technique for reducing bargaining failure only works if all of the bargainers use it (e.g., surrogate goals), then homogeneity could make it much more likely that all bargainers used the technique. On the other hand, it may be that such techniques would not be introduced until after the initial mostly-identical systems were modified / successor systems produced, in which case there might still need to be coordination on common adoption of the technique.
Also, the correlated success / failure point seems to apply to bargaining as well as alignment. For instance, multiple mesa-optimizers may be more likely under homogeneity, and if these have different mesa-objectives (perhaps due to being tuned by principals with different goals) then catastrophic bargaining failure may be more likely.
Makes sense. Though you could have deliberate coordinated training even after deployment. For instance, I'm particularly interested in the question of "how will agents learn to interact in high stakes circumstances which they will rarely encounter?" One could imagine the overseers of AI systems coordinating to fine-tune their systems in simulations of such encounters even after deployment. Not sure how plausible that is though.
I don't think bayesianism gives you particular insight into that for the same reasons I don't think it gives you particular insight into human cognition
In the areas I focus on, at least, I wouldn’t know where to start if I couldn’t model agents using Bayesian tools. Game-theoretic concepts like social dilemma, equilibrium selection, costly signaling, and so on seem indispensable, and you can’t state these crisply without a formal model of preferences and beliefs. You might disagree that these are useful concepts, but at this point I feel like the argument has to take place at the level of individual applications of Bayesian modeling, rather than a wholesale judgement about Bayesianism.
misleading concepts like "boundedly rational" (compare your claim with the claim that a model in which all animals are infinitely large helps us identify properties that are common to "boundedly sized" animals)
I’m not saying that the idealized model helps us identify properties common to more realistic agents just because it's idealized. I agree that many idealized models may be useless for their intended purpose. I’m saying that, as it happens, whenever I think of various agentlike systems it strikes me as useful to model those systems in a Bayesian way when reasoning about some of their aspects --- even though the details of their architectures may differ a lot.
I didn’t quite understand why you said “boundedly rational” is a misleading concept, I’d be interested to see you elaborate.
if we have no good reason to think that explicit utility functions are something that is feasible in practical AGI
I’m not saying that we should try to design agents who are literally doing expected utility calculations over some giant space of models all the time. My suggestion was that it might be good --- for the purpose of attempting to guarantee safe behavior --- to design agents which in limited circumstances make decisions by explicitly distilling their preferences and beliefs into utilities and probabilities. It's not obvious to me that this is intractable. Anyway, I don't think this point is central to the disagreement.
I agree with the rejection of strong Bayesianism. I don’t think it follows from what you’ve written, though, that “bayesianism is not very useful as a conceptual framework for thinking either about AGI or human reasoning”.
I'm probably just echoing things that have been said many times before, but:
You seem to set up a dichotomy between two uses of Bayesianism: modeling agents as doing something like "approximate Solomonoff induction", and Bayesianism as just another tool in our statistical toolkit. But there is a third use of Bayesianism, the way that sophisticated economists and political scientists use it: as a useful fiction for modeling agents who try to make good decisions in light of their beliefs and preferences. I’d guess that this is useful for AI, too. These will be really complicated systems and we don’t know much about their details yet, but it will plausibly be reasonable to model them as “trying to make good decisions in light of their beliefs and preferences”. In turn, the Bayesian framework plausibly allows us to see failure modes that are common to many boundedly rational agents.
Perhaps a fourth use is that we might actively want to try to make our systems more like Bayesian reasoners, at least in some cases. For instance, I mostly think about failure modes in multi-agent systems. I want AIs to compromise with each other instead of fighting. I’d feel much more optimistic about this if the AIs could say “these are our preferences encoded as utility functions, these are our beliefs encoded as priors, so here is the optimal bargain for us given some formal notion of fairness” --- rather than hoping that compromise is a robust emergent property of their training.
The new summary looks good =) Although I second Michael Dennis' comment below, that the infinite regress of priors is avoided in standard game theory by specifying a common prior. Indeed the specification of this prior leads to a prior selection problem.
The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven)
I’m not sure if you mean “there aren’t any theorems to be proven” or “any theorem that’s proven in this framework would be useless”. The former is false, e.g. there are things to prove about the construction of learning equilibria in various settings. I’m sympathetic with the latter criticism, though my own intuition is that working with the formalism will help uncover practically useful methods for promoting cooperation, and point to problems that might not be obvious otherwise. I'm trying to make progress in this direction in this paper, though I wouldn't yet call this practical.
The one benefit I see is that it signals that "no, even if we formalize it, the problem doesn't go away", to those people who think that once formalized sufficiently all problems go away via the magic of Bayesian reasoning
Yes, this is a major benefit I have in mind!
The strategy of agreeing on a joint welfare function is already a heuristic and isn't an optimal strategy; it feels very weird to suppose that initially a heuristic is used and then we suddenly switch to pure optimality
I’m not sure what you mean by “heuristic” or “optimality” here. I don’t know of any good notion of optimality which is independent of the other players, which is why there is an equilibrium selection problem. The welfare function selects among the many equilibria (i.e. it selects one which optimizes the welfare). I wouldn't call this a heuristic. There has to be some way to select among equilibria, and the welfare function is chosen such that the resulting equilibrium is acceptable by each of the principals' lights.
both players want to optimize the welfare function (making it a collaborative game)
The game is collaborative in the sense that a welfare function is optimized in equilibrium, but the principals will in general have different terminal goals (reward functions) and the equilibrium will be enforced with punishments (cf. tit-for-tat).
the issue is primarily that in a collaborative game, the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arbitrarily poorly
Agreed, but there's the additional point that in the case of principals designing AI agents, the principals can (in theory) coordinate to ensure that the agents "know who their partner is". That is, they can coordinate on critical game-theoretic parameters of their respective agents.
Chimpanzees, crows, and dolphins are capable of impressive feats of higher intelligence, and I don’t think there’s any particular reason to think that Neanderthals are capable of doing anything qualitatively more impressive
This seems like a pretty cursory treatment of what seems like quite a complicated and contentious subject. A few possible counterexamples jump to mind. These are just things I remember coming across when browsing cognitive science sources over the years.
- Theory of mind;
- Imagination;
- Folk biology [https://en.wikipedia.org/wiki/Folk_biology];
- "Number sense".
My nonexpert sense is that it is at least controversial both how each of this is connected with language, and the extent to which nonhumans are capable of them.
Yep, fixed, thanks :)
Fixed, thanks :)
Should be "same", fixed, thanks :)
In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.
There are "reliabilist" accounts of what makes a credence justified. There are different accounts, but they say (very roughly) that a credence is justified if it is produced by a process that is close to the truth on average. See (this paper)[https://philpapers.org/rec/PETWIJ-2].
Frequentist statistics can be seen as a version of reliabilism. Criteria like the Brier score for evaluating forecasters can also be understood in a reliabilist framework.
Maybe pedantic but, couldn't we just look at the decision process as a sequence of episodes from the POMDP, and formulate the problem in terms of the regret incurred by our learning algorithm in this decision process? In particular, if catastrophic outcomes (i.e., ones which dominate the total regret) are possible, then a low-regret learning algorithm will have to be safe while still gathering some information that helps in future episodes. (On this view, the goal of safe exploration research is the same as the goal of learning generally: design low-regret learning algorithms. It's just that the distribution of rewards in some cases implies that low-regret learning algorithms have to be "safe" ones.)
I definitely think it's worth exploring. I have the intuition that creating a single agent might be difficult for various logistical and political reasons, and so it feels more robust to figure out the multiagent case. But I would certainly like to have a clearer picture of how and under what circumstances several AI developers might implement a single compromise agent.
Ah, I see now that I did not make this clear at all. The main thing in the case of war is that, under certain payoff structures, a state might not be able to credibly commit to the terms of a peaceful settlement if it is expected to increase in power relative to its counterpart. Thus the state who expects to lose relative power will sometimes rather wage preventative war (while it is still relatively strong) than settle. This is still a problem in models with complete information and divisible stakes.
I'll try to edit the text to make this clearer soon, thanks for bringing it up.
It seems plausible that if players could truthfully disclose private information and divide stakes, the ability to credibly commit would often not be needed
Even if the players can find a settlement that they both prefer to conflict (e.g., flipping a coin to decide who gets the territory) there's still the problem of committing to honoring that settlement (you might still just attack me if the coin doesn't land in your favor). So I think there's still a problem. But maybe you're saying that if there's really no private information, then there is no credibility problem, because players can anticipate defections because they know everything about their counterpart? Something like that?
Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations?
I think it definitely leads to a difference in prioritization among the things one could study under the broad heading of AI safety. Hopefully this will be clear in the body of the agenda. And, some considerations around possible downsides of certain alignment work might be more salient to those focused on s-risk; the possibility that attempts at alignment with human values could lead to very bad “near misses” is an example. (I think some other EAF researchers have more developed views on this than myself.) But, in this document and my own current research I’ve tried to choose directions that are especially important from the s-risk perspective but which are also valuable by the lights of non-s-risk-focused folks working in the area.
[Just speaking for myself here]
I find myself someone confused by s-risks as defined here
For what it’s worth, EAF is currently deliberating about this definition and it might change soon.