Why I find Davidad's plan interesting

post by Paul W · 2024-05-20T08:13:15.950Z · LW · GW · 0 comments

Contents

    Abstract 
    Disclaimers
    Epistemic status
    Acknowledgements
    Outline
  0/Definitions
  I/Overview of the plan
    Context
    Goals
    Means
  II/Technical Questions, Risks, Objections
    Problems to solve
    How it could fail
    The bigger picture
  III/Why I'm interested in contributing
None
No comments

Abstract 

The aim of this text is to give an overview of Davidad’s safety plan [LW · GW], while also outlining some of the limitations and challenges. Additionally, I’ll explain why I would like to contribute to this field. 
 

Disclaimers

I am not Davidad, I tried to convey his ideas as I understand them. While my interpretation may not be exact, I hope it still holds value. Also, this post does not focus on the technical details; I might write another one later, with a deeper technical discussion[1].  

Epistemic status

I began exploring these questions during a weekend-long research sprint. Then, for a couple of months, I kept thinking about them, reading related posts (Davidad’s List of AI Safety Problems [LW · GW]). Additionally, I engaged in discussions with Charbel, who co-authored this post on Davidad's plan [LW · GW]. While empirical evidence may be limited at this stage, I anticipate more to emerge as efforts progress[2]. I'm currently learning potentially relevant mathematical theory. However, the aim here is to offer a broad explanation of key ideas rather than an in-depth analysis of the technical problems.


Acknowledgements

Thanks to Charbel-Raphaël  Segerie for discussing these questions with me. Thanks to Jeanne S. for useful suggestions. 
 

 

Outline


0/ Definitions

I/ The plan, and why it might yield positive results

II/ Technical questions, risks, objections

III/ Why I'm interested in contributing
 

0/Definitions

In what follows, I make a distinction between “Artificial General Intelligence” and “Artificial Super Intelligence”, the latter having strongly superhuman performance levels, while the former can be compared to a human being (if we ignore the speed and duplicability factors).

I shall use “AI Safety” in a loose way[3], essentially relying on the “notkilleveryoneism” idea: the point is to limit the overall amount of harm [4]caused by AIs (and humans), while keeping some of the benefits of powerful AIs. 

A “formalized world-model” is one inside which formal proofs (that accept relevant forms of nondeterminism) can be derived, typically regarding the consequences of some actions. A key underlying assumption, of course, is that the model is close enough to the real world, and that it contains descriptions of the phenomena that are relevant to AI safety[5].   

By “meaningful formal guarantees”, I mean safety-relevant properties (of the AI, or of the system that includes the AI and its deployment environment, or variations of these) derived in a formalized framework, as opposed to empirical testing. The point is that one can then trust these guarantees roughly as much as one trusts the formalized world-model.

When I mention “strategy”, “strategic thinking”, etc. I refer to the art of formulating relevant intermediate steps in order to achieve a given terminal goal, and amending these whenever the context varies. I am not using the term in the standard deterministic game-theoretic sense[6]. The main distinction I make is between “strategic questions”, and “technical questions”. 

When I say "promoting AI Safety", I mean "gathering resources (funds, compute, talent), getting relevant regulation passed and enforced, making the industry use safer solutions, etc." 


 

I/Overview of the plan

 

Context

In my mind, the main underlying assumptions of the plan are the following:


Goals

It's crucial to distinguish between the ambitious long-term vision of the plan and the more achievable intermediate steps. The ultimate objective of the plan might be summarized as "mitigating the risks posed by advanced AIs to a level where we can then focus on further advancing alignment and AI Safety in general." For example, if the aim is indeed to address the problem of "notkilleveryoneism," the assumptions underpinning the third and fourth steps may become less tenable.

Conversely, the incremental goals involve establishing an alternative AI development framework, demonstrating the compatibility of safety measures with economic interests, and formulating a positive objective that encourages cooperation among stakeholders[7]. This is supposed to help promote AI Safety.

 

Means

Now that I have outlined the objectives, let's explore potential pathways to achieve them. Here are the primary technical insights I've identified (with more detailed insights available in Davidad's post [LW · GW]).

 

Now, regarding strategy, these ideas seem relevant:

Now, I'll outline several potential positive outcomes even in the event of failure:

II/Technical Questions, Risks, Objections

 

Problems to solve

Obviously, there are many at least very difficult [13]technical problems to solve to have any chance of success, especially for the maximal goal. The following questions seem most relevant[14]:
 

How it could fail

Even if the technical requirements are fulfilled, there are several potential pitfalls:



The bigger picture

Now, as a counterpoint to my own arguments: even if my understanding of this plan is accurate and its underlying assumptions are valid, I would still need to evaluate it against other plans and determine the most effective approach. It's important to note that the optimal plan, or even just viable ones, might involve investing resources in a combination of proposals[20]. Discussing this kind of strategic questions is worth a dedicated post.
 

III/Why I'm interested in contributing

The perspective of an individual with limited yet potentially valuable resources does not equate to that of humanity as a whole, assuming such a notion is coherent. There may exist superior overall plans to which I might not be able to contribute as significantly. For instance, my strengths lie more in mathematical research than in policy-making[21]

Furthermore, I believe that having a vague sense of the end goal significantly enhances the quality of my research efforts, coupled with at least a glimmer of hope that my work could be impactful. While I'm hesitant to argue that blindly optimistic researchers outperform those who are fully aware (akin to the different skill sets of a captain versus a general), I also refrain from claiming the opposite[22]

Regarding other math-oriented fields within AI Safety research, such as Singular Learning Theory or Agent Foundations, either I'm not as persuaded by their end goals (in the case of SLT), or I perceive my potential contributions as less influential (in AF, SLT). This isn't to suggest that these areas should be disregarded[23]

These are the primary reasons why I am currently set on trying to contribute to Davidad’s programme, one way or another.

 

In case other people are interested: there are several areas, requiring various profiles, where much work is yet to be done. A tentative list:

See also page 6 of the call for proposals, and page 11 of the programme thesis, from which most of this was extracted.

  1. ^

    If you're interested, in a short and more technical summary,  I can recommend this comment [LW(p) · GW(p)].

  2. ^

    One important first step is to assess feasibility of the maximal version of this plan.

  3. ^

    If you really need precise definitions, feel free to plug in your own, and see if the text then makes sense.

  4. ^

    Yes, from this point of view, you need to define precisely enough, what you mean by harm and benefits. Or rather, the outcome you get depends on your specifications.

  5. ^

    Yes, that's a lot.

  6. ^

    Key features are imperfect information, bounded computing power, randomness; see for instance Clausewitz's notion of friction. As far as I know, these constraints imply that strategy is more about finding and applying good heuristics, rather than solving puzzles. Feel free to provide counterarguments in the comments.

  7. ^

    One might even view fostering cooperation as the main goal of the plan. 

  8. ^

    Yes, we would really like to cover all of them.

  9. ^

    One could argue that the AI being trained and verified might find bugs in the world-model, and exploit them. This possibility might be worth investigating.

  10. ^

    The idea here is that these proof-searching AIs can be boxed and are not a threat. They would only be trained to "play the formal proof-searching game".

  11. ^

    As far as I know, training good researchers takes time and resources, and the AI Safety field is short on senior people.

  12. ^

    For instance, work on nondeterministic multi-scale modelling, or development of processes that help express human desiderata, might be used elsewhere.

  13. ^

    Knowing precisely how difficult they are is itself quite a challenge. The first steps involve getting an approximate answer.

  14. ^

    Some tentative answers already exist, but I won't say much about them here. See for instance ARIA's call for proposals, pages 8-10.

  15. ^

    It might be worth recalling that, if your model, or your verification process, takes too long, then it is useless.

  16. ^

    Static analysis would mean checking the AI once, with a given version of the world-model, then deploy it. Note that, if you have a functioning shutdown process, you are still able to iterate.

  17. ^

    Dual-use issues apply to these tools as well, not only to AIs.

  18. ^

    I am afraid I do not know enough about economics and politics to quantify this issue. The game-theoretic analysis (Appendix B) aims at providing an answer.

  19. ^

    Even if it respects the desiderata, a drastic action, formally proved rational, might or might not be accepted by all stakeholders. 

  20. ^

    Consider a basic scenario: suppose there are two proposals, only one of which can be effective in reality. Assuming that the resources required to discern their efficacy don't hinder future success, you could optimize your chances by initially dividing your resources between the two and then directing your focus accordingly.

    However, this isn't the only conceivable scenario. Alternatively, both plans might have potential, but necessitate almost all available resources to yield results. In such a case, concentrating on a single plan would be more prudent.

    The question then arises: which scenario reflects our current reality ?

  21. ^

    And yet, policy-making might be more effective than mathematical research in this case.

  22. ^

    Feel free to give relevant arguments in the comments, if you thought about these questions.

  23. ^

    At least, additional arguments would be required.

0 comments

Comments sorted by top scores.