Davidad's Bold Plan for Alignment: An In-Depth Explanation

post by Charbel-Raphaël (charbel-raphael-segerie), Gabin (gabin-kolly) · 2023-04-19T16:09:01.455Z · LW · GW · 34 comments

Contents

  Context
        How to read this post?
  Bird's-eye view
  Fine-grained Scheme
    Step 1: Understand the problem
    Step 2: Devise a plan
    Step 3 Examine the solution
    Step 4 Carry out the plan
  Hypotheses discussion
  About Category theory and Infra-Bayesianism
  High level criticism
  High Level Hopes
    Intuition pump for the feasibility of creating a highly detailed world model
  Comparison with OpenAI’s Plan
  Conclusion
  Annex
    Governance strategy
    Roadmap
      Stage 1: Early research projects 
      Stage 2: Industry actors first projects
      Stage 3: Labs commitments
      Stage 4: International consortium to build OAA. 
    Some important testable first research projects
      Formalization of the world model
      Try to fine-tune a language model as a heuristic for model-checking
      Defining a sufficiently expressive formal meta-ontology for world-models
      Experimenting with the compositional version control system
      Getting traction on the deontic feasibility hypothesis
      Some other projects
    Types explanation
None
34 comments

Gabin Kolly and Charbel-Raphaël Segerie contributed equally to this post. Davidad proofread this post.

Thanks to Vanessa Kosoy, Siméon Campos, Jérémy Andréoletti, Guillaume Corlouer, Jeanne S., Vladimir I. and Clément Dumas for useful comments.

Context

Davidad has proposed an intricate architecture [AF · GW] aimed at addressing the alignment problem, which necessitates extensive knowledge to comprehend fully. We believe that there are currently insufficient public explanations of this ambitious plan. The following is our understanding of the plan, gleaned from discussions with Davidad.

This document adopts an informal tone. The initial sections offer a simplified overview, while the latter sections delve into questions and relatively technical subjects. This plan may seem extremely ambitious, but the appendix provides further elaboration on certain sub-steps and potential internship topics, which would enable us to test some ideas relatively quickly.

Davidad’s plan is an entirely new paradigmatic approach to address the hard part of alignment: The Open Agency Architecture aims at “building an AI system that helps us ethically end the acute risk period without creating its own catastrophic risks that would be worse than the status quo”.

This plan is motivated on the assumption that current paradigms for model alignment won’t be successful:

Unlike OpenAI's plan, which is a meta-level plan that delegates the task of finding a solution for alignment with future AI, davidad's plan is an object-level plan that takes drastic measures to prevent future problems. It is also based on rather testable assumptions that can be relatively quickly tested (see the annex). 

Plan’s tldr: Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the AI adheres to coarse preferences and avoids catastrophic outcomes.

How to read this post?

This post is much easier to read than the original post. But we are aware that it still contains a significant amount of technicality. Here's a way to read this post gradually:

For more technical details, you can read:

Bird's-eye view

The plan comprises four key steps:

  1. Understand the problem: This entails formalizing the problem, similar to deciphering the rules of an unfamiliar game like chess. In this case, the focus is on understanding reality and human preferences.
    1. World Modeling: Develop a comprehensive and intelligent model of the world capable of being used for model-checking. This could be akin to an ultra-realistic video game built on the finest existing scientific models. Achieving a sufficient model falls under the Scientific Sufficiency Hypothesis (a discussion of those hypotheses can be found later on) and would be one of the most ambitious scientific projects in human history.
    2. Specification Modeling: Generate a list of moral desiderata, such as a model that safeguards humans from catastrophes. The Deontic Sufficiency Hypothesis posits that it is possible to find an adequate model of these coarse preferences.
  2. Devise a plan: With the world model and desiderata encoded in a formal language, we can now strategize within this framework. Similar to chess, a model can be trained to develop effective strategies. Formal verification can then be applied to these strategies, which is the basis of the Model-Checking Feasibility Hypothesis.
  3. Examine the solution: Upon completing step 2, a solution (in the form of a neural network implementing a policy or strategy) is obtained, along with proof that the strategy adheres to the established desiderata. This strategy can be scrutinized using various AI safety techniques, such as interpretability and red teaming.
  4. Carry out the plan: The model is employed in the real world to generate high-level strategies, with the individual components of these strategies executed by RL agents specifically trained for each task and given time-bound goals.

The plan is dubbed "Open Agency Architecture" because it necessitates collaboration among numerous humans, remains interpretable and verifiable, and functions more as an international organization or "agency" rather than a singular, unified "agent." The name Open Agency was drawn from Eric Drexler’s Open Agency Model [AF · GW], along with many high-level ideas.

Here is the main diagram. (An explanation of the notations is provided here [LW · GW]):

Fine-grained Scheme

​​

Here is a more detailed breakdown of Davidad’s plan:

Step 1: Understand the problem

Step 2: Devise a plan

Step 3 Examine the solution

Step 4 Carry out the plan

Hypotheses discussion

Scientific Sufficiency Hypothesis: This hypothesis posits that we will be able to create a simulation of the world that is precise enough so that verifying a model's behavior within the simulation would ensure its proper behavior in the real world. More specifically, as we are operating in an infra-Bayesian setting, we need a large enough set of world models to ensure that reality is not too distant. Additionally, this hypothesis includes the assumption that the simulation will not be excessively computationally demanding since we intend to train a model on it. Davidad agrees that this project would be one of humanity's most significant science projects, but he believes it would still be less costly than the Large Hadron Collider.

There are several challenges associated with the Scientific Sufficiency Hypothesis:

  1. Completeness: The world is incredibly complex, and to create a sufficiently precise model, we might need a large portion of all human knowledge. This represents an immense amount of data, even with the assistance of LLMs. However, a sufficient model does not necessarily require solving all scientific domains. For instance, we don't need advanced physics because an AI that passes the model checking step will be very cautious and unlikely to engage in risky physics experiments. If there is a risk that the AI might interfere with experiments conducted at CERN, the model checking process would prevent it from doing so, even without a unified understanding of physics. We only need to model quantum chemistry and higher levels. But even when restricting ourselves to levels above quantum chemistry, there are still numerous scales to consider, each requiring a separate scientific theory. It is not computationally feasible to simulate an entire room from the molecular level. Furthermore, it remains unclear how much of the biosphere and human civilization must be simulated for this plan to succeed. 
  2. Formalization: Much of our current knowledge about the world is not in a readily computable form. It would need to be translated into a computable form before it could be used in this context. Although this is a challenging task, there is already some prior work demonstrating the feasibility of automatic translation from natural language to a formal language, which offers some hope for this endeavor. [More details [LW · GW]]
  3. Lack of computational Infra-Bayes prior works: As far as the authors are aware, no infra-bayesian simulation has been carried out before, especially not one of this magnitude. As a result, it remains unclear how we could train a model on such a simulation and whether it would even be computationally feasible.
  4. Flexible meta-ontology: The system's underlying meta-ontology must be both expansive and adaptable to accommodate the diverse models and theories required for the simulation. This would be similar to creating a new programming language that efficiently incorporates the concept of time. To achieve this, Davidad is exploring novel category theoretic ideas in the attempt to unify in a single formalism:
    • random differential equations
    • stochastic differential equations
    • stochastic partial differential equations
    • continuous-time Markov chains
    • discrete-time Markov chains
    • probabilistic graphical models
    • probabilistic logic programming
  5. Difficulty of understanding the framework: The framework would necessitate input from numerous experts. However, due to the complexity of the meta-ontology, they may encounter difficulties when verifying the work, particularly as it will be in an infra-bayesian format. Vanessa claims that only three people worldwide fully understand the infra-Bayesian sequence [LW · GW], which poses a significant obstacle for establishing a large-scale international project. However, some efforts show that it’s possible to develop software, such as AlgebraicJulia, that tries to offer both the versatility of category theory and user-friendliness for non-mathematicians.

We believe that addressing the “Lack of computational Infra-bayes prior works” and “Flexible meta-ontology” concerns is crucial before delving too deep into this plan, as they can be tackled independently of the other aspects.

 

Deontic Sufficiency Hypothesis: This hypothesis posits that it is possible to identify desiderata that are adequate to ensure the model doesn't engage in undesirable behavior. Davidad is optimistic that it's feasible to find desiderata ensuring safety for a few weeks before a better solution is discovered, making this a weaker approach than solving outer alignment. For instance, Davidad suggests that even without a deep understanding of music, you can be confident your hearing is safe by ensuring the sound pressure level remains below 80 decibels. However, since the model would still be executing a pivotal process with significant influence, relying on a partial solution for decades could be risky. [More details [LW · GW]]

Model-Checking Feasibility Hypothesis: This hypothesis states that it should be possible to prove bounds on the values the model will achieve for the desiderata. Given the task's magnitude, model-checking would mostly be done automatically. In the last few years, we have seen immense progress in the verification of neural networks. In 2017, the best methods were able to verify a hundred neurons, now in 2022, via branch and bound techniques, we can now verify networks of a million neurons [More details]. Currently, model-checking for RL agents is limited to simple environments, and the authors are uncertain if it can be scaled up within just a few years.

Time bounded Optimization Thesis: This hypothesis proposes that we can discover training techniques and reward functions that encourage time-bounded optimization behavior. A suggestion in this direction is provided here [LW · GW]. This hypothesis allows us to bypass the problem of corrigibility quite simply: “we can define time-inhomogeneous reward [i.e. the reward becomes negative after a time-limit], and this provides a way of "composing" different reward functions; while this is not a way to build a shutdown button, it is a way to build a shutdown timer, which seems like a useful technique in our safety toolbox.”.

About Category theory and Infra-Bayesianism

Why Infra-Bayesianism: We want the world model we create to be accurate and resilient when facing uncertainty and errors in modeling, since we want it to perform well in real-world situations. Infra-bayesianism offers a way to address these concerns.

Why Category Theory: A key to effectively understanding the world may lie in exploring relationships and mappings. Functional programming and category theory are promising options for this task. Category theory enables us to represent complex relationships across various levels of abstraction, which is crucial for constructing a world model that incorporates different competitive theories at different scales of size and time in a collaborative manner. Moreover, it is useful to express infra-bayesianism within a category-theoretic framework. The main bottleneck currently appears to be creating an adequate meta-ontology using category theory. [More details here [LW · GW], and here]

High level criticism

Here are our main high level criticisms about the plan:

High Level Hopes

This plan has also very good properties, and we don’t think that a project of this scale is out of question:

Intuition pump for the feasibility of creating a highly detailed world model

Here's an intuition pump to demonstrate that creating a highly detailed world model might be achievable: Humans have already managed to develop Microsoft Flight Simulator or The Sims. There is a level of model capability at which these models will be capable of rapidly coding such realistic video games. Davidad’s plan, which involves reviewing 2 million scientific papers (among which only a handful contain crucial information) to extract scientific knowledge, is only a bit more difficult, and seems possible. Davidad tweeted this to illustrate this idea:

Comparison with OpenAI’s Plan

Comparison with OpenAI’s Plan: At least, David's plan is an object-level plan, unlike OpenAI's plan, which is a meta-level plan that delegates the role of coming up with a plan to smarter language models. However, this plan also requires very powerful language models to be able to formalize the world model, etc. Therefore, it seems to us that this plan also requires a level of capability that is also AGI. But at the same time, Davidad's plan might just be one of the plans that OpenAI's automatic alignment researchers could come up with. At least, davidad’s plan does not destroy the world with an AI race if it fails.

The main technical crux: We think the main difficulty is not this level of capability, but the fact that this level of capability is beyond the ability to publish papers at conferences like NeurIPS, which we perceive as the threshold for Recursive self-improvement. So this plan demands robust global coordination to avoid foom. And model helping at alignment research seems much more easily attainable than the creation of this world model, so OpenAI’s plan may still be more realistic.

Conclusion

This plan is crazy. But the problem that we are trying to solve is also crazy hard. The plan offers intriguing concepts, and an unorthodox approach is preferable to no strategy at all. Numerous research avenues could stem from this proposal, including automatic formalization and model verification, infra-Bayesian simulations, and potentially a category-theoretic mega-meta-ontology. As Nate Soares said : “I'm skeptical that davidad's technical hopes will work out, but he's in the rare position of having technical hopes plus a plan that is maybe feasible if they do work out”. We express our gratitude to Davidad for presenting this innovative plan and engaging in meaningful discussions with us.

EffiSciences played a role in making this post possible through their field building efforts.

Annex

Much of the content in this appendix was written by Davidad, and only lightly edited by us. The annex contains:

Governance strategy

Does OAA help with governance? Does it make certain governance problems easier/harder?

Here is davidad’s answer:

Roadmap

Here is the ambitious scaling roadmap where things play out as fast as possible is the following.

Stage 1: Early research projects 

Timeline: Q3 2023. Tldr: Academic work done / PoC experiments 

Experiment with the following internships (which are described in the last section of this annex):

Succeed at How to fine-tune a language model as a heuristic for model-checking?

Stage 2: Industry actors first projects

Timeline: Q4 2023. Tldr: Get industry actors interested, starting projects inside labs.

Needs: 

Stage 3: Labs commitments

Timeline: Late 2024 or 2025. We need to get to Stage 3 no later than 2028.  Tldr: Make a kind of formal arrangement to get labs to collectively agree to increase their investment in OAA. This is the critical thing.

Needs: 

Stage 4: International consortium to build OAA. 

Timelines: In order for this to not feel like a slowdown to capabilities accelerationists, we should credibly reach Stage 4 by 2029. Start chat & planning for it in 2024.

Some important testable first research projects

Formalization of the world model

We can get started on this by using an existing modeling language, such as PRISM, and applying methodologies being used for autoformalization (e.g. “distilled back translation”: starting with a big dataset of models in the language, like MCC, using the LLM to translate them into natural language, and then fine-tuning on the transpose of that dataset to get it to learn the formal language).

Try to fine-tune a language model as a heuristic for model-checking

Prompting/fine-tuning/RL’ing LLMs to be useful assistants in writing (and extending) formal world models and specifications. Here would be the process:

Defining a sufficiently expressive formal meta-ontology for world-models

Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state. Here are 3 subjects of internship on this subtopic:

Experimenting with the compositional version control system

“Developing version-control formalisms and software tools that decompose these models in natural ways and support building complex models via small incremental patches (such each patch is fully understandable by a single human who is an expert in the relevant domain).’ This requires leveraging theories like double-pushout rewriting and δ-lenses to develop a principled version-control system for collaborative and forking edits to world-models, multiple overlapping levels of abstraction, incremental compilation in response to small edits.

Getting traction on the deontic feasibility hypothesis

Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

Some other projects

Types explanation

Explanation in layman's terms of the types in the main schema. Those notations are the same as those used in reinforcement learning.

34 comments

Comments sorted by top scores.

comment by So8res · 2023-04-19T18:04:38.867Z · LW(p) · GW(p)

(For context vis-a-vis my enthusiasm about this plan, see this comment [LW(p) · GW(p)]. In particular, I'm enthusiastic about fleshing out and testing some specific narrow technical aspects of one part of this plan. If that one narrow slice of this plan works, I'd have some hope that it can be parlayed into something more. I'm not particularly compelled by the rest of the plan surrounding the narrow-slice-I-find-interesting (in part because I haven't looked that closely at it for various reasons), and if the narrow-slice-I-find-interesting works out then my hope in it mostly comes from other avenues. I nevertheless think it's to Davidad's credit that his plan rests on narrow specific testable technical machinery that I think plausibly works, and has a chance of being useful if it works.)

comment by simeon_c (WayZ) · 2023-04-19T16:33:28.140Z · LW(p) · GW(p)

Thanks for writing that up Charbel & Gabin. Below are some elements I want to add.

In the last 2 months, I spent more than 20h with David talking and interacting with his ideas and plans, especially in technical contexts. 
As I spent more time with David, I got extremely impressed by the breadth and the depth of his knowledge. David has cached answers to a surprisingly high number of technically detailed questions on his agenda, which suggests that he has pre-computed a lot of things regarding his agenda (even though it sometimes look very weird on first sight). I noticed that I never met anyone as smart as him. 

Regarding his ability to devise a high level plan that works in practice, David has built a technically impressive crypto (today ranked 22nd) following a similar methodology, i.e. devising the plan from first principles. 

Finally, I'm excited by the fact that David seems to have a good ability to build ambitious coalitions with researchers, which is a great upside for governance and for such an ambitious proposal. Indeed, he has a strong track record of convincing researchers to work on his stuff after talking for a couple hours, because he often has very good ideas on their field.

These elements, combined with my increasing worry that scaling LLMs at breakneck speed is not far from certain to kill us, make me want to back heavily this proposal and pour a lot of resources into it. 

I'll thus personally dedicate in my own capacity an amount of time and resources to try to speed that up, in the hope (10-20%) that in a couple of years it could become a credible proposal as an alternative to scaled LLMs. 

comment by Fabien Roger (Fabien) · 2024-04-10T21:36:39.470Z · LW(p) · GW(p)

I don't think I understand what is meant by "a formal world model".

For example, in the narrow context of "I want to have a screen on which I can see what python program is currently running on my machine", I guess the formal world model should be able to detect if the model submits an action that exploits a zero-day that tampers with my ability to see what programs are running. Does that mean that the formal world model has to know all possible zero-days? Does that mean that the software and the hardware have to be formally verified? Are formally verified computers roughly as cheap as regular computers? If not, that would be a clear counter-argument to "Davidad agrees that this project would be one of humanity's most significant science projects, but he believes it would still be less costly than the Large Hadron Collider."

Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?

I feel like this formally-verifiable computers claim is either a good counterexample to the main claims, or an example that would help me understand what the heck these people are talking about.

Replies from: Paul W
comment by Paul W · 2024-05-16T08:25:03.873Z · LW(p) · GW(p)

I believe that the current trends for formal verification, say, of traditional programs or small neural networks, are more about conservative overapproximations (called abstract interpretations). You might want to have a look at this: https://caterinaurban.github.io/pdf/survey.pdf 
To be more precise, it appears that so-called "incomplete formal methods" (3.1.1.2 in the survey I linked) are more computationally efficient, even though they can produce false negatives.
Does that answer your question ?

Replies from: Fabien
comment by Fabien Roger (Fabien) · 2024-05-17T12:51:09.460Z · LW(p) · GW(p)

Not entirely. This makes me slightly more hopeful that we can have formal guarantees of computer systems, but is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase? (Given a budget of 1 LHC for both the codebase inefficiency tax + the time to build the formal guarantees for the codebase.)

(And computer systems are easy mode, I don't even know how you would start to build guarantees like "if you say X, then it's proven that it doesn't persuade humans of things in ways they would not have approved of beforehand.")

Replies from: Paul W
comment by Paul W · 2024-05-18T16:04:12.025Z · LW(p) · GW(p)

Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?

As far as I know (I'm not an expert), such absolute guarantees are too hard right now, especially if the AI you're trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I'm not entirely sure I got what you meant by "a guaranteed no-zero-day evaluation and deployment codebase". Would you mind explaining more ?
 

"Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?" 

I think that's closer to the idea: you {reject and penalize, during training} as soon as the AI tries something that might be "exploiting a zero-day", in the sense that the world-model can't rule out this possibility with high confidence[1]. That way, the training process is expected to reward simpler, more easily verified actions.


Then, a key question is "what else you do want from your AI ?": of course, it is supposed to perform critical tasks, not just "let you see what program is running"[2], so there is tension between the various specifications you enter. The question of how far you can actually go, how much you can actually ask for, is both crucial, and wide open, as far as I can tell.

  1. ^

    Some of the uncertainty lies in how accurate and how conservative the world-model is; you won't get a "100% guarantee" anyway, especially since you're only aiming for probabilistic bounds within the model.

  2. ^

    Otherwise, a sponge would do.

Replies from: Fabien
comment by Fabien Roger (Fabien) · 2024-05-20T12:06:29.283Z · LW(p) · GW(p)

I was thinking that the formal guarantees would be about state evaluations (i.e. state -> badness bounds) - which would require sth like "showing there is no zero-day" (since "a code-base with a zero-day" might be catastrophically bad if no constraints are put on actions). Thanks for pointing out they can be about action (i.e. (state, action) -> badness bounds), which seem intuitively easier to get good bounds for (you don't need to show there are no zero-days, just that the currently considered action is extremely unlikely to exploit a potential zero-day).

I'd be curious to know what kind of formal process could prove that (codebase, codebase-interaction) pairs are provably not-bad (with high probability, and with a false positive rate low enough if you trained an AI to minimize it). My guess is that there is nothing like that on the horizon (that could become competitive at all), but I could be wrong.

("let you see what program is running" was an example of a very minimal safety guarantee I would like to have, not a representative safety guarantee. My point is that I'd be surprised if people got even such a simple and easy safety guarantee anytime soon, using formal methods to check AI actions that actually do useful stuff.)

comment by dr_s · 2023-04-20T12:19:10.001Z · LW(p) · GW(p)

We only need to model quantum chemistry and higher levels.

As someone with years of practical experience in quantum chemistry simulation, you can't understate how much heavy lifting that "only" is doing here. We are not close, not even remotely close, not even we-can-at-least-see-it-on-the-horizon close to the level of completeness required here. For a very basic example, we can't even reliably and straightforwardly predict via quantum simulations whether a material will be a superconductor. Even guessing what quantum mechanics does to the dynamics of atomic nuclei is crazy hard and expensive, I'm talking days and days of compute on hundreds of cores thrown at a single cube of 1 nm side.

The problem here is that the reason why we'd want ASI is because we expect it might see patterns where we don't, and thread the needle of discovery in the hyperdimensional configuration space of possibility without having to brute force its way through it. But we have to brute force it, right now. If ASI found a way to make nanomachines that relies on more exotic principles than basic organic chemistry, or is subtly influenced by some small effect of dispersion forces that can't be reliably simulated with our usual approximations, then we'd need to be able to simulate theory to at least that level to get at that point of understanding. We need ASI to interpret what ASI is doing efficiently...

My immediate impression is that this doesn't blast the whole plan open. I think you can reasonably decouple the social, economical and moral aspect of the model and the scientific one. The first one is also hard to pin down, but for very different reasons, and I think we might make some progress in that sense. It's also more urgent, because current LLMs aren't particularly smart at doing science, but they're already very expert talkers (and bullshitters). Then we just don't let the AI directly perform scientific experiments. Instead, we have it give us recipes, together with a description of what they are expected to do, and the AI's best guess of their effect on society and why they would be beneficial. If the AI is properly aligned to the social goals, which it should be at this point if it has been developed iteratively within the bounds of this model, it shouldn't straight up lie. Any experiments are then to be performed with high levels of security, airgaps, lockdown protocols, the works. As we go further, we might then incorporate "certified" ASIs in the governance system to double-check any other proposals from different ASIs, and so on so forth.

IMO that's as good as it gets. If the values and the world model of the AI are reliable, then it shouldn't just create grey goo and pass it as a new energy technology. It shouldn't do it out of malice, and shouldn't do it by mistake, especially early on when its scientific capabilities would still be relatively limited. At that point of course developing AI tools to e.g. solve the quantum structure and dynamics of a material without having to muck about with DFT, quantum Monte Carlo or coupled cluster simulations would have to be a priority (both for the model's sake and because it would be mighty useful). And if it turns out that's just not possible, then no ASI should be able to come up with anything so wild we can't double check it either.

Replies from: Roman Leventov
comment by Roman Leventov · 2023-09-18T17:04:44.360Z · LW(p) · GW(p)

"Solving quantum chemistry" is not the domain of ASI, it's a task for a specialised model, such as AlphaFold. An ASI, it if need to solve quantum chemistry, would not "cognise" it directly (or "see patterns" in it) but rather develop an equivalent of AlphaFold for quantum chemistry, potentially including quantum computers into its R&D program plan.

comment by Charlie Sanders (charlie-sanders) · 2023-04-20T12:47:44.808Z · LW(p) · GW(p)

My intuition is that a simulation such as the one being proposed would take far longer to develop than the timeline outlined in this post. I’d posit that the timeline would be closer to 60 years than 6.

Also, a suggestion for tl;dr: The Truman Show for AI.

Replies from: PeterMcCluskey, alexander-gietelink-oldenziel, jacob_cannell, charbel-raphael-segerie
comment by PeterMcCluskey · 2023-04-20T19:03:36.129Z · LW(p) · GW(p)

Agreed.

Davidad seems to be aiming for what I'd call infeasible rigor, presumably in hope of getting something that would make me more than 95% confident of success.

I expect we could get to 80% confidence with this basic approach, by weakening the expected precision of the world model, and evaluating the AI on a variety of simulated worlds, to demonstrate that the AI's alignment is not too sensitive to the choice of worlds. Something along the lines of the simulations in Jake Cannell's LOVE in a simbox [LW(p) · GW(p)].

Is 80% confidence the best we can achieve? I don't know.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-21T16:29:57.314Z · LW(p) · GW(p)

My understanding is that the world model is more like a very coarse projection of the world than a simulation

It's not the case that the AGI has to be fooled into thinking the simulation is real like in the Truman Show (I like name tho!).

Davidad only tries to achieve 'safety' - not alignment. Indeed the AI may be fully unaligned.

The proposal is different from simulation propoals like Jacob Cannell's LOVE in a simbox where one tries to align the values of the AI.

In davidads proposal the actual AGI is physically boxed and cannot interact with the world except through proposing policies inside this worldmodel (which get formally checked in the second stage).

One way of thinking about is that davidads proposal is really an elaborate boxing protocol but there are multiple boxes here:

The physical Faraday cage that houses the hardware The interface constraint that constraints the AI to only output into the formal world model The formal cage that is achieved by verifying the behaviour through mathmagic.

Although the technical challenges seems daunting, especially on such short timelines this is not where I am most skeptical. The key problem, like all boxing proposals, is more of a governance and coordination problem.

comment by jacob_cannell · 2023-04-21T17:17:01.165Z · LW(p) · GW(p)

Generally agree with that intuition, but as others point out the real Truman Show for AI is simboxing (which is related [LW(p) · GW(p)], but focuses more on knowledge containment to avoid deception issues during evaluations).

Davidad is going for more formal safety where you can mostly automatically verify all the knowledge in the agent's world model, presumably verify prediction rollouts, verify the selected actions correspond to futures that satisfy the bargaining solution, etc etc. The problem is this translates to a heavy set of constraints on the AGI architecture.

LOVES is instead built on the assumption that alignment can not strongly dictate the architecture - as the AGI architecture is determined by the competitive requirements of efficient intelligence/inference (and so is most likely a large brain-like ANN). We then search for the hyperparams and minimal tweaks which maximize alignment on top of that (ANN) architecture.

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2023-04-20T19:56:30.716Z · LW(p) · GW(p)

I don't have the same intuition for the timeline, but I really like the tl;dr suggestion!

comment by bgold · 2023-04-27T14:17:33.934Z · LW(p) · GW(p)

First off thank you for writing this, great explanation.

  • Do you anticipate acceleration risks from developing the formal models through an open, multilateral process? Presumably others could use the models to train and advance the capabilities of their own RL agents. Or is the expectation that regulation would accompany this such that only the consortium could use the world model?
  • Would the simulations be exclusively for 'hard science' domains - ex. chemistry, biology - or would simulations of human behavior,  economics, and politics also be needed? My expectation is that it would need the latter, but I imagine simulating hundreds of millions of intelligent agents would dramatically (prohibitively?) increase the complexity and computational costs.
Replies from: gabin-kolly
comment by Gabin (gabin-kolly) · 2023-05-02T23:08:39.148Z · LW(p) · GW(p)
  • The formal models don't need to be open and public, and probably shouldn't be. Of course this adds a layer of difficulty, since it is harder to coordinate on an international scale and invite a lot of researchers to help on your project when you also want some protection against your model being stolen or published on the internet. It is perhaps okay if it is open source in the case where it is very expensive to train a model in this simulation and no other group can afford it.
  • Good question. I don't know, and I don't think that I have a good model of what the simulation would look like. Here is what my (very simplified, probably wrong) model of Davidad would say:
    • We only want to be really sure that the agent is locally nice. In particular, we want the agent to not hurt people (or perhaps only if we can be sure that there are good reasons, for example if they were going to hurt someone). The agent should not hurt them with weapons, or by removing the oxygen, or by increasing radiations. For that, we need to find a mathematical model of human boundaries, and then we need to formally verify that these boundaries will be respected. Since the agent is trained using time-bounded RL, after a short period of time it will not have any effect anymore on the world (if time-bounded RL indeed works), and the shareholders will be able to determine if the policy had a good impact on the world or not, and if not, train another agent and/or change the desiderata and/or improve the formal model. That's why it is more important to have a fine model of chemistry and physics, and we can do with a coarser model of economics and politics. In particular, we should not simulate millions of people.
  • Is it reasonable? I don't know, and until I see this mathematical model of human boundaries, or a very convincing prototype, I'll be a bit skeptical.
comment by Chipmonk · 2023-05-04T17:10:45.056Z · LW(p) · GW(p)

I've compiled most if not all of everything Davidad has said about «boundaries» (which are mentioned in this post insofar as "deontic feasibility hypothesis" and "elicitors") to date here: «Boundaries and AI safety compilation [LW · GW]. Also see: «Boundaries» for formalizing a bare-bones morality [LW · GW]

comment by Chipmonk · 2023-09-15T23:56:28.175Z · LW(p) · GW(p)

FWIW I find using the word "alignment" when specifically what you mean is "safety" to be confusing. I consider AI alignment to be "getting actively positive things that you want because of AI". I consider AI safety to be "getting no actively negative things to happen because of AI". And it seems that Davidad's approach is much more about safety than it is alignment [LW · GW].

comment by Chipmonk · 2023-05-11T15:48:21.412Z · LW(p) · GW(p)

The overleaf project linked in the last word of "Why Category Theory" is restricted

comment by Ozyrus · 2023-04-20T05:22:00.338Z · LW(p) · GW(p)

Very interesting. Might need to read it few more times to get it in detail, but seems quite promising.

I do wonder, though; do we really need a sims/MFS-like simulation?

It seems right now that LLM wrapped in a LMCA [LW · GW] is how early AGI will look like. That probably means that they will "see" the world via text descriptions fed into them by their sensory tools, and act using action tools via text queries (also described here [LW · GW]). 

Seems quite logical to me that this very paradigm in dualistic in nature. If LLM can act in real world using LMCA, then it can model the world using some different architecture, right? Otherwise it will not be able to act properly. 

Then why not test LMCA agent using its underlying LLM + some world modeling architecture? Or a different, fine-tuned LLM.

 

Replies from: Darcy
comment by Dalcy (Darcy) · 2023-05-12T02:59:57.679Z · LW(p) · GW(p)

I think the point of having an explicit human-legible world model / simulation is to make desideratas formally verifiable, which I don't think would be possible with a blackbox system (like LLM w/ wrappers).

comment by Lorxus · 2024-08-02T16:46:33.048Z · LW(p) · GW(p)

, and here]

This overleaf link appears to be restricted-access-only?

comment by Review Bot · 2024-03-17T09:21:23.765Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by wassname · 2024-01-01T02:54:02.060Z · LW(p) · GW(p)

This system's complexity almost precludes the use of Python for modeling; a new functional programming language specifically designed for this task would likely be necessary, potentially incorporating category theory. Humans would verify the world model line by line.

 

Python is turing complete though so it should be enough. Of course we might prefer another language, especially one optimised for reading rather than writing, since that will be most of what humans do.

comment by wassname · 2024-01-01T02:52:36.682Z · LW(p) · GW(p)

To expedite the development of this world model using LLMs, methods such as this one could be employed.

 

Check out this World Model using transformers (IRIS-delta https://openreview.net/forum?id=o8IDoZggqO). It seems like a good place to start, and I've been tinkering. It's slow going because of my lack of compute however.

comment by Chipmonk · 2023-12-23T20:36:51.667Z · LW(p) · GW(p)

Getting traction on the deontic feasibility hypothesis

Was using the word "feasibility" here, instead of "sufficiency" like it is everywhere else in this post (and this post [AF(p) · GW(p)]) just an accidental oversight? Or somehow intentional?

Replies from: charbel-raphael-segerie
comment by Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-23T22:49:40.887Z · LW(p) · GW(p)

ah, no this is a mistake. Thanks

comment by Dawn Drescher (Telofy) · 2023-11-23T12:59:46.947Z · LW(p) · GW(p)

Thanks so much for the summary! I'm wondering how this system could be bootstrapped in the industry using less powerful but current-levels-of-general AIs. Building a proof of concept using a Super Mario world is one thing, but what I would find more interesting is a version of the system that can make probabilistic safety guarantees for something like AutoGPT so that it is immediately useful and thus more likely to catch on. 

What I'm thinking of here seems to me a lot like ARC Evals with probably somewhat different processes. Humans doing tasks that should, in the end, be automated. But that's just how I currently imagine it after a few minutes of thinking about it. Would something like that be so far from OAA to be uninformative toward the goal of testing, refining, and bootstrapping the system?

Unrelated: Developing a new language for the purpose of the world modeling would introduce a lot of potential for bugs and there'd be no ecosystem of libraries. If the language is a big improvement over other functional languages, has good marketing, and is widely used in the industry, then that could change over the course of ~5 years – the bugs would largely get found and an ecosystem might develop – but that seems very hard, slow, risky, and expensive to pull off. Maybe Haskell could do the trick too? I've done some correctness proofs of simple Haskell programs at the university, and it was quite enjoyable.

comment by Roman Leventov · 2023-09-18T16:54:51.131Z · LW(p) · GW(p)

I don't think deontic desiderata and constraints are practically sufficient to train (obtain) ethical behaviour that we would approve: virtue ethics (non-deontic lens on ethics, vs. deontology and consequentialism which are deontic) exists and is still viable for a reason. That being said, seeding the policy training stage with LLM-based heuristics could effectively play the role of "virtue ethics" here: literally, these LLMs could be prompted to suggest policies like a "benevolent steward of the civilisation". If my guess (that deontic desiderata and constraints are not enough) is true, than these heuristic policies should leave a non-trivial mark on the final trained policy, i.e., effectively select among desiderata-conforming policies.


RL Limitations: While reinforcement learning has made significant advancements, it still has limitations, as evidenced by MuZero's inability to effectively handle games like Stratego. To address these limitations, assistance from large language models might be required to bootstrap the training process. However, it remains unclear how to combine the strengths of both approaches effectively—leveraging the improved reliability and formal verifiability offered by reinforcement learning while harnessing the advanced capabilities of large language models.

In general, I think the place of RL policies in the OAA would be better served by hierarchical embedding predictive architecture with GFlowNets as planner (actor) policies [LW · GW]. This sidesteps the problem of "designing a reward function" because GFlowNet-EM algorithm (Zhang et al., 2022) can jointly train the policy and the energy (reward) function based on the feedback from the simulator and desiderata/constraints model.

In fact, the LLM-based "exemplary actor" [LW · GW] (which I suggested to "distill" into an H-JEPA with GFlowNet actors here [LW · GW]) could be seen as an informal model of both Compositional World Model and Policy Search Heuristics in the OAA. And the whole process that I described in that article could be used as the first "heuristical" sub-step in the Step 2 of the OAA, which would serve three different purposes:

  • Reduce the cost of the training of the policy on the feedback from simulated world model alone, which can be prohibitively expensive because the reward signal is too sparse. Which is the stated purpose of using Policy Search Heuristics in the OAA.
  • Connect the policy with language and thus make the policy explainable: the side effect of training the H-JEPA with GFlowNet actors on the outputs of an LLM-based agent is that we also get an auxiliary model that produces "reference" explanations for the given plans. We can reasonably expect that these explanations will be "faithful" rather than degenerate or deceptive, as I argue here [LW · GW].
  • The LLM-based "exemplary actor" may incorporate some "virtue ethics" and thus address the potential "deontic insufficiency" (as I touch upon in the beginning of this comment).

That being said, I speculated [LW · GW] that even this "heuristical" sub-step may take orders of magnitude more compute and/or orders of magnitude larger DNNs than next-gen SoTA LLMs circa 2024-2026. The "main" training with Simulated World Model and desiderata/constraints could take much more compute, still. Thus, I have serious concerns [LW(p) · GW(p)] whether training such a policy will be computationally feasible, especially considering the problem of World Model invalidation as any new bit of scientific knowledge enters the picture.

On the other hand, I doubt that complex systems such as societies, ecosystems, economics/markets, and the civilisation could be modelled long-term with reasonable fidelity. Also, as OAA is introduced into the civilisation, it generates Gödelian problems of the limits of self-modelling. Furthermore, the highest levels of planning and modelling in the OAA might be dominated by the issues of political bargaining rather than finding a workable policy (at this stage, the civilisation might be sufficiently powerful to execute a wide range of reasonable policies if everyone agrees with the plan).  Given all this, it seems that Steps 2-4 might even be unnecessary, and instead we have people or LLMs suggesting relatively short-term plans and bargaining them. The massive step forward here is just the formalisation of desiderata, constraints, and bargaining, all of which in turn rest on the World Model.

Replies from: davidad
comment by davidad · 2023-09-18T17:20:03.527Z · LW(p) · GW(p)

I like the idea of trying out H-JEPA with GFlowNet actors.

I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.

comment by Roman Leventov · 2023-09-18T14:55:09.242Z · LW(p) · GW(p)

The main technical crux: We think the main difficulty is not this level of capability, but the fact that this level of capability is beyond the ability to publish papers at conferences like NeurIPS, which we perceive as the threshold for Recursive self-improvement. So this plan demands robust global coordination to avoid foom. And model helping at alignment research seems much more easily attainable than the creation of this world model, so OpenAI’s plan may still be more realistic.

I'd call this the main (or one of the main) strategic crux. The main technical cruxes are the simulation feasibility, the feasibility of training a good policy from the sparse signal from the simulation if the latter is feasible, and the "political bargain" part, in particular the issue of representing nested, overlapping, and nebulous stake-holders/constituents (such as families, communities, societies, nations, ecosystems, etc.)

comment by Chipmonk · 2023-07-14T23:28:03.371Z · LW(p) · GW(p)

For the deontic feasibility hypothesis, do you have any expectation for whether formalizing the moral desiderata (specifically: «boundaries») will or should ultimately be done by 1) humans; or 2) automated AI alignment assistants? @davidad [LW · GW

Replies from: davidad
comment by davidad · 2023-07-22T18:05:38.094Z · LW(p) · GW(p)

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.