AI Neorealism: a threat model & success criterion for existential safety

post by davidad · 2022-12-15T13:42:11.072Z · LW · GW · 1 comments

Contents

  Threat Model
  Success Model
  Related work
None
1 comment

Threat Model

There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me:

But, instead of trying to enumerate all possible failure modes and then trying to shape incentives to make them less likely to come up, I typically use a quasi-worst-case assumption in which I assume that, perhaps as a matter of bad luck with random initialisation,

On the one hand, unlike a typical "prosaic" threat model, in the neorealist threat model one does not rely on empirical facts about the inductive biases of the kind of network architectures that are practically successful. A realist justification for this is that there may be a phase transition as architectures scale up [AF · GW] which drastically changes both their capabilities profile and this kind of inductive bias (vaguely analogous to the evolution of cultural knowledge-transfer within biological life).

On the other hand, unlike (a typical understanding of) a "worst-case assumption," the last clause leaves open the possibility of hiding concrete facts about our world from an arbitrarily powerful model [AF · GW], and the framing in terms of functions highlights an ontology of AI that respects extensional equivalence [AF · GW], where imputations of "deceptive mesa-optimisers hiding inside" are discarded in favour of "capable but misaligned outputs on out-of-distribution inputs".

One can make progress with this assumption by designing training contexts which couple safety guarantees to the training objective, e.g. a guarantee of shutdown within a time bound with arbitrarily high probability [AF · GW], and by working on ways to obtain instance-specific guarantees about learned functions that continue to hold out-of-distribution, e.g. with model-checking, regret bounds, or policy certificates [AF · GW].

Success Model

For me the core question of existential safety is this:

It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).

From a neorealist perspective, the ultimate criterion for "goodness" of an AI strategy is that it represents

I am optimistic about the plausibility of negotiations to adopt AI strategies that clear this bar, once such strategies become clear, even if they do not strictly meet traditional standards of "competitiveness". On the other hand, any strategy that doesn't clear this bar seems to require unrealistic governance victories to be implemented in reality. I hope this articulation helps to clarify the implications of governance/strategy upon the relative merits of technical safety research directions.

Related work [LW · GW]

1 comments

Comments sorted by top scores.

comment by Chipmonk · 2023-07-13T06:20:17.729Z · LW(p) · GW(p)

I think you missed an opportunity by saying "ethically end the acute risk period" instead of "the safety-first approach", heh