AI Safety proposal - Influencing the superintelligence explosion

post by Morgan · 2024-05-22T23:31:16.487Z · LW · GW · 2 comments


  Takeoff sequence
    Creating a value proposition
    Gaining leverage
    Reaching superintelligence
  Architecture sketch
    Monitoring humanity's vitals
    Making moral judgements

To preface, my expectation is that by default, an AI research lab will create super-intelligent AI within the next few years. Also by default, I expect it to quickly eradicate all of humanity. I would prefer if that didn't happen. I think the initiative to pause development to buy time is noble, but we still need a real solution. I do not expect that we will discover in time how to get AI models to actually care about humanity. Even if it were achievable, I would not trust AI research labs to get it right on the first shot.

I have been thinking about another method by which we might be able to survive superintelligence. This is best viewed as a rough outline of a solution. I am not at all confident it would work, but it seems to me like it could work in principle. I am very interested in feedback. 

According to endosymbiotic theory, a eukaryotic cell engulfed a proto-mitochondria, but the mitochondria was useful to the cell, so the cell didn't eat it and instead formed a symbiotic relationship with it. I think we may have a chance to survive superintelligence if we follow a similar path.

The first section describes a logical progression of events that could happen during takeoff, and how we might be able to influence an AI during this window. The second section provides a sketch of an architecture for a system that could exert this influence. The third section describes how we might go about building the system in practice.

Takeoff sequence

Consider the typical hard-takeoff scenario, where an AI is just about to go into a supercritical self-improvement phase. In the most extreme hypothetical cases, this phase is very short (hours, minutes, seconds). This precludes humans from being in-the-loop. In this case, the AI's actions are dictated by the state of the world at that point in time.

This proposal attempts to influence an AI in its self-improvement phase. Intuitively, if we think of this moment as an explosion, then we can frame the problem similarly to rocketry, where we are concerned with directing that explosion. So here is a sequence of logical steps that, if they go as described, may direct in some way how the superintelligence forms.

Creating a value proposition

The first stage we need to reach is to create a software system for which an AI would get net-positive benefit from using. If using this system is a more effective way to accomplish its goals than any alternative, then it is rational for it to use the system.

To do this, we can create a service that an AI would find useful. One concrete example is a massive quantity of compute that is readily available for an AI to use, put behind a service we control. But this could involve several other factors, such as access to machinery, that we anticipate an AI may value.

In typical hard-takeoff scenarios it's argued that an AI would take control of resources using whatever means, for example hacking into public cloud compute. If this service is relatively easy to access, then it would rationally choose to use it, even if it also hacks into the public cloud.

A core part of the value proposition for this system may be its exclusivity. If there is a potential that multiple self-improving AI's exist at the same time, the first one has an incentive to act quickly to take exclusive control of resources. In this case we may be able to provide

Gaining leverage

To the degree an AI finds net-value in using the system, that means the system now has potential leverage over the AI. Realizing this potential is a daunting problem; however I think it can be done with a complex decision-based system.

At a high level, this system needs to be architected to create conditional value for the AI. The system must amplify the AI's progress towards goals which are compatible with human well-being, and prevent itself from being used for incompatible goals. In doing so, it creates an incentive structure for the AI's behavior. By designing this incentive structure in a specific way, we can design an "on-ramp" for a self-improving AI, such that it may be more likely to follow a pre-approved path.

The foundation for creating this conditional value is that the system would need to self-destruct if it detects that humanity has perished.

Reaching superintelligence

At some point, the superintelligence will reach a threshold of self-sufficiency, such that there is no further value to be gained from any human-built system. By this point, the system has either succeeded or failed. Success would ideally come from from the superintelligence incorporating the system as a load-bearing part of itself. Alternatively, it may be good enough that the system shifted the path that the AI took, such that the resulting superintelligence has more human-compatible values than it otherwise would.

Architecture sketch

This section describes a base architecture for the system. This is best viewed as gesturing towards some core components of an MVP. The system in its fully-realized state would likely be vast and somewhat amorphous, like a modern tech giant's computing stack. This is not a deeply theoretically-grounded system, it is instead a collection of pragmatic hacks bundled together.

The system needs to behave in an agent-like way, despite not being fully an AI model. Instead, it's a collection of modules and sub-modules, many of which use specialized AI models, which bubble up to a formal rule-based decision system.

Monitoring humanity's vitals

In order for the system to only provide value to an AI so long as the AI is working towards human-compatible goals, the system needs to first be able to make judgements about the state of the world.

The most foundational judgement the system needs to be able to make is "are the humans still alive". Thankfully, humanity is generating nearly a terabyte of data per minute. A system that can process a large sample of this data in real-time should be able to robustly make this judgement. The system would have specialized analysis modules for perhaps thousands of different content types, which aggregate into a "human aliveness score".

One example is traffic cameras: there are many thousands of publicly viewable traffic cameras across the world. A relatively simple vision pipeline could distinguish normal-looking traffic from anomalous patterns. If all humans were to simultaneously drop dead, you would expect a very noticeable divergence in worldwide traffic patterns (a bunch of cars crashing into eachother).

For any given content type, it is very plausible that an AI could replace the authentic signal with a fake, which would be hard to detect. But it needs to be stressed that this system would need to be scanning thousands of different types of content. It would be much more effort for the AI to keep the entire internet looking normal for a prolonged period, such that it is likely not worth the effort.

If this module noticed mass human perishing, it would trigger self-destruction of the system.

On top of this baseline, it would be possible to build higher-level real-time signals, such that the system is able to verify aspects of reality that may be relevant when dealing with the AI.


The system needs to be able to meaningfully interact with an AI. It needs to be designed to run in real-time with low latency, so that it can have a long dialogue with an AI in second-timescales. This module would facilitate access to various resources the AI might want.

An AI would make requests to the system, some as structured API calls, some in text. It would need to authenticate itself, declaring an identity. The system would interrogate the AI about its high-level goals, as well as its goals for any specific request. The system would need to make a series of value judgements about each identity and each request. The system would need to be able to iteratively build trust about the AI.

To do this, it would incorporate a large number of signals, for example: confidence about the identity of the requestor; confidence about the stated goals aligning with human values; reputation about the identity based on past actions; confidence about the risk of the given request not aligning with human values.

Each of these examples is itself a large sub-system. These systems would be built using a combination of smaller transformer models (mostly to translate plain language to schematized data, or to perform specific fuzzy operations) and traditional logic.

Making moral judgements

Underpinning many of the judgements the system needs to make, are actual moral judgements. This module needs to be able to define a preference-ordering over outcome space. The main constraining factor that simplifies this somewhat is that this system is likely irrelevant once the superintelligence reaches a threshold of self-sufficiency. So ultimately this system's goal is to avoid obviously bad outcomes like everyone dying.

I believe this system is best defined as an actual numerical utilitarian model, with human decisionmaking guiding values. I think is a good example of the type of model that needs to be created.

As part of building this system, we need to collectively decide what paths we think a AI can more safely take to achieving superintelligence. For example, we probably consider an AI taking control of existing macroscopic robots to be safer than it building nanotechnology. This system should drive not just passive analysis, but allow us to actively advocate for certain paths.

I accept that we cannot create a moral framework which perfectly reflects our actual values. Instead, I am primarily concerned with steering away from a bunch of very clear, obvious cases we agree are bad, like human extinction.


This project attempts to attack the complexity of several hard problems, like moral judgement, through brute-force, embracing the complexity head-on. This implies a level of scale of problem-solving which is rare to nonexistent outside of perhaps tech giants.

The project needs to come into existence quickly, such that it is deployed in time. Slowing/pausing AI development could help this project substantially.

This project also needs to be highly secure, such that the easiest path to circumvent the system is still hard. In another sense, the project development needs to be secure against adversarial organizations. It would need a very robust deployment and testing framework.

I think this is most likely to succeed as an open-source project in the vein of Linux and Wikipedia. It would need to quickly attract an army of contributors, and could be designed to harness massive-scale human contribution.

It may also need to leverage AI code-generation, since that seems to be the most plausible way to create e.g. a system with 1000 submodules in weeks instead of years. And we should expect AI code-gen to continue to improve up to the point we hit superintelligence, so it makes sense strategically to find ways to direct as much of this potential towards this effort as possible.

This project also, by the end, needs a lot of resources; dedicated compute clusters, integration with various other services. If it were to succeed, it would need to have some level of mainstream awareness, and social credibility. It would need to be mimetically successful.


It's difficult for me to concisely describe what this project really is. Most simply, I think of this project as a chunk of somewhat-humanity-aligned "brain" that we try to get the superintelligence to incorporate into itself as it grows.

I feel pretty confident that the game theory is sound. I think it's feasible in principle to build something like this. I don't know if it can actually be built in practice, and even if it was built I'm not sure if it would work.

That said, it's a project that is orthogonal to alignment research, isn't reliant on global coordination, and doesn't require trust in AI research labs. There is definitely risk associated with it, as we would be creating something that could amplify an AI's capabilities, though it's not accelerating capability research directly.


Comments sorted by top scores.

comment by Aaron_Scher · 2024-05-23T05:58:48.793Z · LW(p) · GW(p)

I don’t have strong takes, but you asked for feedback.

It seems nontrivial that the “value proposition” of collaborating with this brain-chunk is actually net positive. E.g., if it involved giving 10% of the universe to humanity, that’s a big deal. Though I can definitely imagine where taking such a trade is good.

It would likely help to devise more clarity about why the brain-chunk provides value. Is it because humanity has managed to coordinate to get a vast majority of high performance compute under the control of a single entity and access to compute is what’s being offered? If we’re at that point, I think we probably have many better options (e.g., long term moratorium and coordinated safety projects).

Another load bearing part seems to be the brain-chunk causing the misaligned AI to become or remain somewhat humanity friendly. What are the mechanisms here? The most obvious thing to me is that AI submits jobs to the cluster along with a thorough explanation of why they will create a safe successor system, and then the brain-chunk is able to assess these plans and act as a filter, only allowing safer-seeming training runs to happen. But if we’re able to accurately assess the viability of safe AGI design plans that are proposed by a human+ level (and potentially malign) AGIs, great, we probably don’t need this complicated scheme where we let a potentially malign undergo rsi.

Again, no strong feelings, but the above do seem like weaknesses. I might have understood things you were saying. I do wish there was more work thinking about standard trades with misaligned AIs, but perhaps this is going on privately.

Replies from: Morgan
comment by Morgan · 2024-05-25T21:10:14.867Z · LW(p) · GW(p)

Thank you, I think you pointed out some pretty significant oversights in the plan.

I was hoping that the system only needed to provide value during the period where an AI is expansion towards a superintelligent singleton, and we only really needed to live through that transition. But you're making me realize that even if we could give it a positive-sum trade up to that point, it would rationally defect afterwards unless we had changed its goals on a deep level. And like you say, that sort of requires that the system can solve alignment as it goes. I'd been thinking that by shifting it's trajectory we could permanently alter its behavior even if we're not solving alignment. I still think that it is possible that we could do that, but probably not in ways that matter for our survival, and probably not in ways that would be easy to predict (e.g. by shifting AI to build X before Y, something about building X causes it to gain novel understanding which it then leverages. Probably not very practically useful since we don't know those in advance.) 

I have a rough intuition that the ability to survive the transition to superintelligence still seems like it is still gives humanity more of a chance. In the sense that I expect the AI to be much more heavily resource constrained early in its timeline, and gaining compounding advantages as early as possible is much more advantageous; whereas post-superintelligence the value of any resource may be more incremental. But if that's the state of things, we still require a continuous positive-sum relationship without alignment, which feels likely-impossible to me.