In response to critiques of Guaranteed Safe AI

post by Nora_Ammann · 2025-01-31T01:43:05.787Z · LW · GW · 2 comments

Contents

      Acknowledgements
  I. Brief summary of “Towards Guaranteed Safe AI” 
  II. Commonly-encountered critiques, and some responses
    1. The proposed architecture won’t work
      1.1. You can’t get a good enough world model
        Critique: 
        Response:
      1.2. You can’t get good enough guarantees
        Critique: 
        Response: 
        Types of Guarantees
        Time horizon
        Steganography, and other funny business 
      1.3. You can’t get good enough safety specifications
        Critique:
        Response: 
    2. Other approaches are a better use of time
      2.1. Does not address the most serious risks 
        Critique: 
        Response: 
        Threat model: what AI capability levels does GSAI target? 
        What kinds of problems does GSAI aim to solve, and which does it not?
        Securing cyberspace
        Protection against biosecurity hazards
        The role of assurance in coordination
      2.2. AI won’t help (soon) enough
        Critique
        Response: 
      2.3. Lack of ‘portability' & independent verification of proof certificate
        Critique
        Response: 
  3. Conclusion
None
2 comments

In this post, I discuss a number of critiques of Guaranteed Safe AI (GSAI) approaches I’ve encountered since releasing this position paper

The critiques I’m responding to here are taken from a range of places, including personal conversations and social media. The two most substantive written/recorded critiques I'm aware of (and had in mind when writing this post) are Andrew Dickson’s Limitations on Formal Verification for AI Safety [LW · GW] and Zac Hatfield-Dodds’s talk Formal Verification is Overrated. I’ve done my best to do justice in our depiction of those critiques, but I certainly welcome clarifications or pushback where I seem to have missed the original point of a given critique.

To be clear, the position I'm defending here (and will be adding more nuance to throughout this piece) is not that GSAI is definitely going to succeed. Nor that it is a solution to all of AI risk. What I'm arguing for is that, ex ante, GSAI is worth serious investment, and that none of the critiques we’ve encountered so far succeed in undermining that claim. 

In order to make this discussion more read- and searchable, I’ve organized the critiques into categories and subcategories, each of which contains one or several more specific formulations of objections and (where available) examples of where this objection has been raised in context.  

  1. The proposed architecture won’t work
    1. You can’t get a good enough world model
    2. You can’t get good enough guarantees
    3. You can’t get good enough safety specifications
  2. Other approaches are a better use of resources
    1. GSAI does not address the most serious risks
    2. AI won’t help (soon) enough
    3. Lack of 'portability' and independent verification of proof certificate

Some of these critiques, I will argue, are based on misunderstandings of the original proposal, while others are rooted in (what seems to be) genuine disagreements and empirical uncertainties. Depending on the context, I will either try to restate the original proposals more clearly, or discuss why I believe that the proposed limitations do not undermine, on the whole, the reasons for thinking that GSAI is a worthwhile family of agendas to pursue. 

Two more caveats before I start: 

*** 

In section II, I will address a selection of critiques which I found to be (a combination of) most common and substantive. Before that, in section I, I provide a brief, self-contained summary of the “Towards Guaranteed Safe AI” paper. Feel free to skip this section if you’re already familiar with the paper. If not, this will help set the stage and make this post more accessible. 

Acknowledgements

Many of the ideas presented have been informed by discussions with David 'davidad' Dalrymple, as well as the other authors of the Towards Guaranteed Safe AI paper. I'm additionally grateful to Evan Miyazono, Ben Goldhaber and Rose Hadshar for useful comments on earlier drafts. Any mistakes are my own.

I. Brief summary of “Towards Guaranteed Safe AI” 

I provide a short, self-contained summary of the paper Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems.

‘Guaranteed Safe AI’ (GSAI) refers to a family of approaches which share a high-level architecture – involving a world model, a safety specification and a verifier – which together can provide high-assurance safety guarantees for advanced AI systems. Instead of having a powerful AI system interact with the world directly, its outputs are mediated through the three-component architecture, verifying that the outputs conform to the safety specification within certain bounds. 

This family of approaches stands in contrast to current practices that primarily rely on empirical evaluations – identifying examples of unsafe behaviour (e.g. through independent testing and red-teaming) which can then be patched. However, due to the inability to test across all possible inputs and future deployment environments, such methods cannot provide rigorous, quantifiable safety guarantees. 

Given the wide reaching applications of advanced AI systems and, in particular, the potential use of AI in safety-critical domains, we believe that they should be required to comply with high and rigorous safety standards – just like it is the case with other safety critical infrastructure such as aircraft, nuclear power plants, medical devices, etc. Above some risk or capability thresholds, the burden of demonstrating such safety guarantees should be on the systems’ developers, and the provided evidence must be adequate to justify high confidence that a given AI system is safe.

The general architecture of GSAI consists of three core components. 

  1. The safety specification (or distribution over safety specifications) corresponds to a property that we wish an AI system to satisfy.
  2. The world model (or distribution over world models) is the environment in which the safety specification is specified. A world model describes the dynamics of the AI system’s environment in such a way that it can answer what would happen in the world as a result of a given output from the AI.
  3. The verifier checks, relative to the world model, that the system does in fact adhere to the safety specification and determines the (probabilistic) safety envelope in which that system will continue to operate safely.
Figure taken from Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems.

Each of the components can be implemented at different levels of rigour and tractability. There exists a range of research avenues, many of which we discuss in the paper, that could plausibly produce such an GSAI in a satisfactory and feasible manner, while remaining competitive with AI systems without such guarantees. (See sections 3.2 to 3.4 for more details.)

The safety of GSAI applications should be further bolstered by an adequate deployment infrastructure. This could look like monitoring the AI system at runtime, checking whether the incoming observational data is consistent with the mathematical model of the environment in order to spot potentially safety-relevant anomalies. If such anomalies are detected, then the advanced AI system would be disabled or transition to a trustworthy backup system, to ensure that there is no unaccounted for risk of unsafe behavior. Furthermore, it might be useful to modulate the time horizon of the AI outputs which are verified in order to increase the reliability of safety guarantees, or increase the tractability of formal verification. 

Important challenges exist for successful implementations of GSAI, as well as reasons for optimism. For example, there already exist useful current-day or near-future applications of GSAI (e.g. in passenger aviation, railway, or autonomous vehicles), substantiating the claim of their tractability and usefulness. Furthermore, while it can be challenging to create sufficiently accurate world models, the crux of the task is not in creating a maximally precise (or complete) model, but instead making sure the model preserves soundness. In other words, the world model is not required to eliminate all sources of uncertainty but instead must be capable of appropriately handling such uncertainty, such that we arrive at a model that’s a sound overapproximation of the true dynamics of the given context of use. Most importantly, maybe, we can make use of current day frontier AI tooling to enhance and accelerate our ability to build the required world models or make verification tractable. 

Beyond a plausible avenue to high-assurance quantitative safety guarantees, GSAI approaches promise the following further benefits: 

Ultimately, I argue that we need an “anytime” portfolio approach of R&D efforts towards safe AI, and that GSAI approaches are a critical part of that. This will allow us to maximise the expected effectiveness of the feasible safety techniques at each stage.

II. Commonly-encountered critiques, and some responses

Below, I have compiled the critiques that I’ve found to be (a combination of) most common and substantive.

1. The proposed architecture won’t work

1.1. You can’t get a good enough world model

Critique: 

I’ve encountered a number of reasons for skepticism about the tractability of making formal models of a context of use which are good enough that they can underpin trustworthy, dependable safety guarantees. For example: 

Response:

Yes, GSAI guarantees are about a model of the world. And yes, a model of the world is never a perfect representation of the actual world. 

However, if we’re thoughtful, we’re still able to derive substantive statements about safety from this set up. In particular, if our model is able to soundly overapproximate the true dynamics of the world, then we are able to conservatively guarantee our safety specifications. How might we do that? 

First, let us unpack what I mean by sound overapproximation. The point is that we’re not using a deterministic model. We’re not even using a probabilistic model. We’re using a model that is more conservative than that, but it is conservative in a specific way. The model we’re after allows us to make probabilistic claims about the distribution of values that some variable is going to take, but it also allows us to be agnostic about the value of some other variable. What does agnostic mean here? Imagine our world model makes a claim about the expected value of the temperature in the actual system – say it claims that the temperature is going to be less than 50. In other words, the model is agnostic about what exact value the temperature variable is going to take–it just says it will be less than 50. Importantly, this claim can be accurate, without needing to be exact. And in order to achieve sound guarantees, what we need first and foremost is accuracy, not exactness.   

The language can be slightly confusing here at first. In formal methods, we make a distinction between soundness and completeness–soundness means that everything the model says is true, and completeness means that everything that is true can be deduced from the model. A similar distinction is made in mathematics, differentiating between accuracy (~soundness) and precision/exactness (~completeness). In essence: to achieve useful safety guarantees, our model needs to preserve soundness, but it does not necessarily require complete knowledge.

In other words, we want our formal model to be able to incorporate different sources of uncertainty: Bayesian/probabilistic uncertainty (i.e. uncertainty across possible world trajectories), but also non-determinism (i.e. uncertainty across possible distributions over possible world trajectories).We can incorporate non-determinism using imprecise probability (or Infrabayesian probabilities).[1] If we manage to do so successfully, we end up with a model that is a conservative but sound overapproximation of the true ‘real world’ dynamics.

Ok, so, our quest is to find a way to construct formal world models that are both dependable and traceable. Now, in addition to accounting for different sources of uncertainty, it is also the case that, for many problems, it is sufficient to model only a certain abstraction layer without needing to model all micro-level processes. Take for example Google DeepMind’s recent table-tennis robot. The robot is trained entirely in a simulation of the table-tennis arena, which was fit to data. The hardest part to model was the contact of the soft squishy rubber with the ping-pong ball. But they learned an abstraction, using three extra springs, that turns out to be good enough for the purpose of beating most humans. (In this case, the abstraction doesn’t have a certified error bound, but runtime monitoring could detect if real data is deviating from the abstraction and switch to a backup controller.)

Note that our world model (or guarantees that depend on such a model) are not dependent on complete data of initial conditions of the system in question. If we were using a deterministic model, complete information about initial conditions would indeed be critical. But this is no longer the case if we incorporate stochasticity or non-determinism. In other words, the full initial conditions would be critical if we required the predictions of our models to be precise, but it often suffices to be able to predict that, for example, a given variable will not exceed a certain value. 

Finally, insofar as we’re worried about the mismatch between our world model and the true, real-world dynamics, there are additional layers of defence beyond the model itself that can help make sure that any differences do not end up causing critical safety failures. For example, a runtime monitor – a common design feature in systems engineering today – provides further safety assurance post-deployment by tracking the deployed system real-time and noticing when the observed data deviates from what has been predicted will occur. If such a deviation is observed – implying that the model the guarantees depend on has potentially safety-relevant errors – the system then falls back to a more conservative operational policy restoring the safety of the system. (I will say a few more things about deployment infrastructure that can help secure GSAI systems in section 1.3.)

Still, would this be enough to expand the reach of our current formal modelling abilities to the complexity of relevant cyber-physical systems? Sure, there exist systems today where we depend on their formally verified properties. But these systems are simpler than where we need to get to, and their creation is currently very labour intensive. That’s not a great combination of starting conditions. 

On top of what I’ve already discussed above, a key crux here is that I believe there are a couple of different ways that we can leverage progress in frontier AI to help us build these models better, faster and for more complex environments. I discuss this in more detail in section 2.2 (“AI won’t help (soon) enough”). For now, let us just note that, while I concur that there is certainly space for reasonable disagreement about how much and how soon we can leverage AI capabilities to our advantage, it seems clearly within a reasonable distribution of beliefs that, upon concerted efforts, such tooling will come live soon and transform our abilities substantially enough such as to extend the reach of formal modelling from relatively simple systems to increasingly more complex, cyber-physical systems. 

Remember that any strategy for creating safe AI systems must rely on some beliefs or assumptions about the real world; assumptions which could be wrong. Short of a solution of ambitious alignment, it is not immediately clear that GSAI approaches require more or more contentious assumptions than other AI safety approaches. Unlike other approaches, GSAI makes these assumptions explicit and puts front and center the development of workflows with the central goal of making the world model safer. It does so by innovating upon a modelling language in order to account for a wider range of types of uncertainty; by pushing for those models to be human auditable, rather than black boxes; and by enabling the iterative and collaborative creation of model specs. 

1.2. You can’t get good enough guarantees

Critique: 

Some reasons for scepticism I’ve encountered here are:[2]

Response: 

Types of Guarantees

The critique suggests a false dichotomy between guarantees of 100% confidence on the one hand, and mere estimates that come with no guarantees on the other. 

There are different types of proofs, which grant different degrees of assurance, and which vary in how tractable they are in different contexts. For example: 

  1. boolean proof establishes the truth/falsity of a statement, e.g. “Subsystem A and subsystem B will never be operating at the same time.”
  2. proof of a sound bound establishes a valid upper or lower bound on a quantity or function. In our context, it means that we know for sure that the probability is ⋜ or ⋝ than some number.
    1. E.g. “We can guarantee that there is a ⋜1% chance that the temperature of subsystem C will exceed 15°C over the course of its operation.“ (See figure 4, level 8-9 in the GSAI paper.)
  3. proof of (asymptotic) convergence to a correct statement establishes that a quantity or function approaches a specific value as a variable approaches a particular limit (such as infinity or zero). In our context, it means that the output will assume, in the limit of e.g. compute, an upper (or lower) bound on the probability of some event. Such a proof might additionally make statements about how the value will converge, e.g. asymptotically. This proof is weaker in that we can’t trust a given likelihood estimate (due to the possibility of missing modes), but at least its trustworthiness increases monotonically with compute.
    1. E.g. “If we were to run this computation for long enough, we could guarantee that the likelihood of the system existing in the safe zone is below some firm bound. We currently expect this bound to be at 1% likelihood.”  (See figure 4, level 6-7 in the GSAI paper.) 

Furthermore, (as discussed in section 1.1) guarantees are expressed with respect to a formal model of the context of use which can itself incorporate different sources of uncertainty, adding further nuance to the specific claim that is being guaranteed. (See also this comment from Augustin [LW(p) · GW(p)] making the same point.)

Time horizon

It certainly is an open question over what time horizons a given policy can be verified to be safe or recoverable.[3] It’s worth pointing out, though, that even guarantees over relatively short time horizons could go a long way. 

Citing from the GSAI preprint (p. 16): “To increase the reliability of safety guarantees, one may instead choose to deploy an AI system for a finite time horizon, such that, if at the end of that time horizon the AI system is not successfully verified for redeployment, operations shut down or transition to a safe mode.” There is a trade off between how quickly the next policy needs to be verified for redeployment, and how much easier it is to verify a given policy to be safe or recoverable if it concerned a shorter time horizon. The right balance will vary depending on the context. 

Furthermore, different architectural choices for the AI controller and gatekeeper can further help alleviate these concerns. For example, the so-called black-box simplex architecture involves an advanced controller and a baseline controller. The former is verified, not with respect to whether it will always keep the system within the safe portion of the state space, but instead with respect to whether it will keep the system within a part of the state space that is recoverable (to a safe trajectory) by the (more conservative) baseline controller. As a result, while we’re running the advanced controller, we only have to verify the recoverability of the system over a short time horizon (over at least two time steps). This way, if we predict that after two more steps on the advanced controller, we would exit the recoverable area, we can in the next step switch over to the baseline controller. In other words, the advanced controller by design has to be verified only over short time horizons, while maintaining high-assurance safety guarantees. This is a specific example of what such an architecture could look like, but there might also be other, better ways to do this.

Overall, the concern that trustworthy guarantees might only be tractable for short time horizons, even if true, would not be particularly detrimental to the usefulness of GSAI. 

Steganography, and other funny business 

The key here is that we want to formally verify not only the correctness but also the uniqueness of a proposed solution. If we thus have a solution that is proven to solve the task at hand without conflicting with the safety specs (i.e. it is correctness), and we also have a proof saying that this solution is unique, then the causal power of the system proposing this solution (the contained advanced AI) is limited to at most one bit of information: it can either give the unique correct answer, or it can decline to answer. This would defend against risks from steganography. 

If needed, it’s also possible to give up some rigor to improve tractability. Rather than requiring uniqueness, we could upper bound the amount of entropy contained in the choice of which solution the AI proposes. 

1.3. You can’t get good enough safety specifications

Critique:

Critiques we’ve encountered here include: 

Response: 

First, note that GSAI will often try to specify safety in a way that is more conservative than specifying human preferences wholesale – namely by specifying what parts of the state space are okay / not okay to enter in an inherently context-specific way. We can separately determine functional/performance-related specification and, for example, conceive of the resulting verification problem as a reach-avoid problem. This also implies that GSAI is not trying to be a solution for ambitious (intent) alignment or agent foundations – instead (to put it slightly simplistically): GSAI is safety engineering applied to ASL-4 AI, where the goal is to allow us to safely leverage (parts of) the transformative potential of these advanced AI systems. 

Second, we don’t even necessarily need to specify some holistic notion of safety. I already mentioned the black-box simplex architecture above, in section 1.2. This architecture allows us to verify the advanced controller not with respect to its safety, but with respect to (the easier to verify property of) reachability - i.e. keeping the overall system within a part of the state space that is recoverable by the backup controller. Other set ups that are similar to this block-box simplex architecture, but differ in the details, could provide similar benefits. 

Third, we can leverage narrow AI systems to help spot existing lacunas or inconsistencies in our safety specs by, for example, generating surprising implications or counterexamples based on the current specifications. To further unlock our ability to iterate over and improve the specifications over time, it is critical to have the appropriate computational implementation of our formal model and specifications, including version controls and collaborative editing affordances. 

Finally, most AI safety approaches face similar challenges in terms of the intent-specification gap. One advantage that many GSAI approaches offer is that they make the specification fully explicit and auditable by different humans. (This is unlike, for example, Constitutional AI or similar approaches, where it is not entirely clear how the ‘constitution’ really caches out in terms of the AI system’s behavioural propensities.) While this certainly does not on its own suffice to overcome the intent-specification gap, it does come with important benefits, such as enabling more social accountability and feedback from a wide range of stakeholders. 

2. Other approaches are a better use of time

2.1. Does not address the most serious risks 

Critique: 

Response: 

There’s a number of things I wish to address here. Upfront, I want to clarify: I don’t believe or wish to claim that GSAI is a full or general panacea to AI risk. In fact, I think there are specific thread models and problems that GSAI is a promising solution to – and other problems I don’t think GSAI is a fit for. In the following, I hope to clarify how I think about the scope of GSAI, and also explain why, despite the limitations of the scope, I believe that GSAI approaches have a critical role to play towards securing and stabilizing humanity’s trajectory towards Superintelligent AI. But let us start at the beginning. 

Threat model: what AI capability levels does GSAI target? 

AI systems with different levels of capabilities come with different risks, and require different mitigation strategies. It is thus natural to ask: what level of capability and associated risks does GSAI propose to be a solution to? 

In short, GSAI seeks to be a solution to the problem of safely leveraging (parts of) the transformative potential of AI systems that are, on one hand, more capable than systems that are capable of significantly enhancing the capability of rogue actors to cause large-scale harm, and, on the other hand, not capable enough to discover and exploit new physical principles in silico. 

Note that ASLs here do not exactly reflect Anthropic’s definition of AI Safety Levels.

The level of capability that is being targeted here is one where current-day safety methods (such as evals and post-training methods like circuit breakers) are likely insufficient to provide adequate levels of safety. At the same time, GSAI is not proposing to be a solution to ambitious alignment. For example, GSAI approaches are not concerned with robustly guiding or altering the goals or intentions or thoughts of a highly advanced AI system. Instead, GSAI is concerned with building a high security containment solution for said system, in such a way that we are still able to elicit useful information from it. [4]

So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology; and/or where they pose a significant risk of self-exfiltrating and autonomously replicating similar to a piece of (autonomous & potentially highly destructive) malware; and/or where they are capable of autonomous R&D and self-improvement (which is why, at this point, a case of model (self-)exfiltration could have catastrophic consequences). 

What are the current plans[5] of leading AI labs for when AI systems achieve this level of capability? As I understand it, the answers are approximately that, while they do not currently have concrete mitigation plans, the “plan” is to have a better idea of what to do by the time these capabilities come live. Sometimes, the answer also includes a (voluntarily) commitment to pausing development and deployment if, at the time that we reach those levels of capabilities, we do not have a sensible plan for how to mitigate the associated risks. 

Insofar as GSAI offers a concrete engineering solution to this problem, I do believe that GSAI approaches, if successful, have the potential to ‘make a serious dent’ in the context of AI safety. 

What kinds of problems does GSAI aim to solve, and which does it not?

The model-based approach of GSAI is, in essence, best suited for domains that we have a solid scientific understanding of, and which are not dominated by chaotic dynamics. 

This makes GSAI approaches less suited for problems that depend on individual humans’ actions in a way that cannot reliably be abstracted by, e.g. the law of large numbers. Similarly, I don’t think GSAI is a promising approach for building safe chat bots – natural language is such an open domain that formally specifying safety seems intractable in this context. 

On the other hand, GSAI approaches are a good fit for use cases like management or control problems in, say, electrical grids, supply chains or R&D processes. In general there are a lot of socially and economically valuable domains where the needed scientific understanding does exist – more than you might think at first sight. 

Furthermore, even if GSAI methods are used to produce safe applications of AI systems in relatively narrow domains, these domains could go a long way in reducing more immediate risks. A key way in which I expect GSAI will start manifesting substantial strategic importance is by helping us secure critical infrastructure. This in turn would reduce the risk of rogue human or AI actors causing large-scale havoc, which is plausibly critical in preserving our collective ability to navigate through this AI transition, stabilising multi-stakeholder dynamics, and enabling pivotal processes [LW · GW]. Examples of such applications include: 

Finally, and importantly, while each individual GSAI application might be relatively narrow in its domain of application, we can still achieve a relatively wide and general reach if we are able to produce many specific GSAI applications across a wide range of contexts. In other words, while individual GSAI applications might be correctly described as narrow, we could still achieve significant degrees of generality via the GSAI workflow (i.e. the workflow that produces specific GSAI implementations).  

Let us briefly touch on how I see GSAI approaches potentially securing cyber and bio specifically. 

Securing cyberspace

One of the critiques in section 1.1 was that while using formal methods along the lines of GSAI might be feasible in the domain of software, it won’t be tractable to extend this approach to physical systems. Now, whether or not we’re optimistic about extending the scope of GSAI approaches, it is worth highlighting how big a deal it would be if we even “just” successfully secured all critical cyber systems through GSAI-type approaches. 

The last decade has seen some important efforts and successes – e.g. a fully formally verified microkernel called seL4 or DARPA’s HACMS/SMACCM program which established the feasibility of using formal methods to eliminate all exploitable bugs. However, until now, this has been highly labour intensive work with limited scope for automation, which has limited its scalability. Now, LLMs trained with “RL from proof assistant feedback” are beginning to show rapid progress in automated theorem proving, with success rates on the miniF2F benchmark rising from 26% to 63% in the last year. Automated formal verification could be a real game changer for cybersecurity with focused investment in this area, including inferring formal specifications from legacy code and test suites, and user interface tooling, along the lines of the Atlas Computing research agenda. In particular, I think there is a strong argument that this route to addressing cyber risks is vastly superior to approaches that try to leverage AI for cybersecurity without formal verification. For one, the defence-offense balance is not very attractive in the case of coding assistants that can help find and fix bugs. Such AI systems are just as useful when it comes to finding and exploiting such bugs (e.g. by rogue actors). Furthermore, there is emerging evidence that currently coding assistants make our code less secure. On the other hand, formally verified software not only gives a marginal advantage to defence, but could plausibly allow us to exit the cyber cat-and-mouse game entirely. 

Protection against biosecurity hazards

The sample quote indicates skepticism above that GSAI approaches will be able to substantially reduce biosecurity risks. (“[W]e should not expect provably-safe DNA synthesis machines, …“). The critique (in the full post) claims that virulence is predictable only by simulating an entire body at the molecular level. Now, if what you wanted was precision – like, knowing exactly how many times person A will cough after having been infected with a certain virus – then, yes, you would need to measure all the initial conditions to an impossible extent and do an impossible amount of computation. But if you want to guarantee that a particular virus will not infect a human, you “just” have to look pairwise at every protein coded for in the virus, and every protein expressed on the boundary (membrane) of human cells, and model their interaction (within a tiny, 100nm-radius patch of cell membrane) to see if the viral protein can catalyze viral entry into the human cell. If not, it ain’t virulent. 

This is a form of sound overapproximation I discussed earlier. Notably, it would also rule out, say, some helpful gene therapies. Nevertheless, this gives us something substantial to work with in terms of the reduction of biorisk. Now, how concretely could we use this set up to bootstrap a more comprehensive “civilisational immune system” from here. (I’m not claiming that this is the only promising solution, nor do I intend to imply that the following story contains enough detail to be necessarily convincing. I’m sharing this, in part, as an illustrative and hopefully thought-provoking potential set up.) 

Building on the above approach, we want to analyse DNA sequences (specifically, those reachable by bench-top manipulation) for their binding affinity to human transmembrane proteins. Doing so, we can identify and ‘tag’ all fragments with significant potential for facilitating cell entry (via receptor-mediated endocytosis or membrane fusion). Of course, binding propensity alone is too broad of a measure, capturing all dangerous sequences, but also many useful sequences (such as drugs with important therapeutic/medical functions). So, in the next step, we want to build up a repository of antibodies capable of neutralizing each potential threat protein. Simultaneously, we want to deny the synthesis of fragments for which this problem is unsolvable. Next, we want to store each solution in a big database, make sure we are able to synthesize the monoclonal antibodies at points of care, and build out a monitoring system which scans for any of these potential threats in (highly scaled-up) metagenomics. In sum, the interplay between these different capabilities adds up to a comprehensive “civilisational immune system”, capable of identifying potential threats and quickly producing antibodies when any such threat is actually observed. 

Of course, this is too hard for current capabilities in various ways, nor will it be an easy feat to build out the full capabilities. But I do think that by being smart about how we leverage AI progress, substantial progress is possible. If you believe that a “compressed 21st century” is ahead of us, you might also be inclined to believe that these sorts of civilisational capabilities will become possible, and are worth shooting for. 

The role of assurance in coordination

The final reason I want to discuss here for why GSAI could be a big deal concerns the critical role that assurance plays in coordination. 

GSAI approaches can produce independently auditable assurances that certain properties are true, without requiring mutual trust between the relevant parties. These safety relevant properties could in principle be anything that is specifiable; the fact that a DNA sequence being synthesized has not been flagged as potentially dangerous, or that a compute cluster has not used more than X FLOPS during the training process, etc. If we can formalise the safety properties that we want to agree on between a certain set of stakeholders, GSAI approaches can produce independently auditable assurances that these properties are true for a given AI model.

Such properties could even be verified in zero-knowledge, further reducing the need to share potentially sensitive information between stakeholders beyond what is strictly necessary to provide assurance and enable coordination. In comparison, an assurance that, say, a collectively determined red line has not been crossed, where this assurance is based on evaluations and red teaming, is much harder to build justified trust around, at least not without sharing a lot of additional, likely commercially and strategically sensitive data. 

2.2. AI won’t help (soon) enough

Critique

Response: 

In order for GSAI approaches to plausibly be ready in time, we require a substantial ramp-up in our ability to produce formal models and specifications, compared to the historic base rate. However, AI progress seems poised to solve large parts of this problem for us. 

A central way to leverage AI for this is by means of an AI-assistant for declarative modelling. Such a formal modelling assistant might make suggestions for parts of the declarative model upon human instruction; it might suggest edits to an existing models; it might generate counterexamples relative to the current model specs (e.g. based on predicted surprise for the human) to help the human to notice lacunas, contradictions or misspecifications; etc.  

This is very similar to how AI is already extremely useful today in classical programming (e.g. Copilot, Cursor). It’s not a far cry to expect AI assistants for declarative mathematical modelling soon, which can help us elicit safety specifications e.g. by producing informative counterexamples, proof certificates, etc. And we already have evidence that this capability is coming online (see e.g. WorldCoder), without having had any large sums of funding flow into this line of R&D, and quickly increasing performance on benchmarks pertaining to mathematics. 

Of course, we don’t want to fully automate world modelling. If we did, we would make ourselves susceptible to the errors resulting from mistakes or misalignment (e.g. scheming). Instead, we want to train narrow AI systems to become highly proficient in our modelling language and to assist human experts in writing the world models. The formal models themselves ideally remain fully human auditable. The key issue is that this AI augmented process does not itself open new avenues to risks. 

There are other ways we should be able to leverage AI, beyond the “programming assistant”-type setup I talked about above. For example, we could use AI systems to mine the scientific literature to extract a hypothesis space AI systems can later reason over, potentially by doing approximate bayesian inference (along the lines of Yoshua Bengio’s Cautious Scientist AI proposal, building on GFlowNets). Because this piece is long already, I’m not going to discuss this approach in more detail here. 

Notably, while the bitter lesson is often taken to imply that symbolic approaches in general are doomed, in actuality the bitter lesson is an argument against hard coded heuristics, and for anything that scales well through parallelisation – such as learning and search. And it is “merely” learning and search that we’re exploiting when it comes to AI systems capable of assisting with formal modelling and verification efforts. There are important open technical challenges, as well as the need for more data, but those don’t seem insurmountable given consolidated R&D efforts. 

At the same time, I don’t think that AI capable of using formal methods implies hard take-off because those capabilities are very much of the type that we’re already seeing today (e.g. WorldCoder from above, strong programming assistants, steady progress on math benchmarks, etc.), and we pretty clearly do not appear to be in a hard take-off world as things stand.

2.3. Lack of ‘portability' & independent verification of proof certificate

Critique

Response: 

We can combine a given proof certificate with cryptographic authentication methods to reliably authenticate that the proof at hand belongs to a specified system and has not been changed in the interim. This could be enabled by hardware-enabled governance mechanisms such as flexHEG. As the sample quote points out, to be really confident, we also need to be able to trust the hardware itself to not have been tampered with (e.g. here). 

3. Conclusion

This concludes our little tour – for now. 

Some of these critiques seem to come from misunderstanding about the proposed approach. For example, GSAI does not promise “100% safety”, or to solve ambitious alignment. 

Other critiques do seem to respond directly to what we're proposing with GSAI. I actually think that many of the critiques or reservations are pretty reasonable. But I disagree in that they succeed in undermining GSAI as a worthwhile AI safety approach that warrants a set of consolidated R&D efforts. 

Ultimately, I believe that GSAI is proposing a promising pathway for solving problems that are likely critical for the survival of humanity: providing an engineering solution to contain and safely leverage ~ASL-4 AI; building out civilisational resilience; enabling coordination. 

If you think I have missed some key critiques, or misunderstood/misrepresented any of the above critiques, I'd appreciate hearing about that. 

  1. ^

    The Safeguarded AI programme’s Technical Area 1.1 is funding researchers to develop a modelling language with these, and additional properties.

  2. ^

    As an aside, in this talk titled “Formal Verification is Overrated”, Zac argues that “model weights are just completely intractable to reason about.” I largely agree. To clarify, GSAI approaches do not seek to make guarantees about large AI models, but about the formal world model. (Said world model might contain parts that are represented by small neural nets, but nothing close to the size of the large AI models people in AI Safety are generally concerned with.) As such, I consider this critique not centrally relevant to GSAI.

  3. ^

    We explain what we mean by recoverable in Section 1.3.

  4. ^

    GSAI and the control agenda are similar in the level of capability that is being targeted, and that they are doing so, in essence, by trying to securely contain and use untrusted powerful AI systems, without necessarily trying to align the contained AI. 

  5. ^

     As specified in e.g. Anthropic’s Responsible Scaling Policies, OpenAI’s Preparedness Framework or Google DeepMind’s Frontier Safety Framework. 

2 comments

Comments sorted by top scores.

comment by Quinn (quinn-dougherty) · 2025-01-31T05:09:55.684Z · LW(p) · GW(p)

I gave a lightning talk with my particular characterization, and included "swiss cheese" i.e. that gsai sources some layers of swiss cheese without trying to be one magic bullet. But if people agree with this, then really guaranteed-safe ai is a misnomer, cuz guarantee doesn't evoke swiss cheese at all

Replies from: Nora_Ammann
comment by Nora_Ammann · 2025-01-31T05:13:06.582Z · LW(p) · GW(p)

What's the case for it being a swiss cheese approach? That doesn't match how I think of it.