A short critique of Omohundro's "Basic AI Drives"

soumyadeep-bose

A short critique of Omohundro's "Basic AI Drives"

post by Soumyadeep Bose (soumyadeep-bose) · 2024-12-19T19:19:52.864Z · LW · GW · 0 comments

  Summary
  Some Background
  The Critique
    Assumptions related to Utility Functions
    Emergent Behaviours and Predictability
    Goal Emergence and Human Parallels
    Risks from Latent Misalignment
    Insights from Mechanistic Interpretability
    Adversarial and External Interactions
    Ethics and Moral Truths
    Other Potential Updates to the Paper
  References Used
None
No comments

Summary

Omohundro, in his paper, assumes goal-oriented AIs with well-defined utility functions, but contemporary models like GPT-3.5 and Claude maximally exhibit behaviours without explicit goals. Advanced models such as GPT-o1 and Claude-3.5, however, have shown preliminary indications of goal-oriented behaviours in controlled environments.

Omohundro’s reliance on predictable rationality is undermined by emergent capabilities- for example, grokking, multi-step reasoning and zero-shot learning- which arise unpredictably due to phase transitions, often caused by scaling parameters. These behaviours challenge the linear assumptions of rationality and gradual capability development. Research has also shown that AI systems can develop emergent behaviour when placed in competitive or adversarial scenarios. This paper also underestimates risks like latent misalignment, where AIs exploit proxy systems, as seen in “wireheading” or adversarially misaligned GPT-4 outputs.

Moreover, modern AI systems often develop behaviours influenced by external interactions, with models like GPT-o1 demonstrating deceptive strategies in adversarial settings. Omohundro’s framework must adapt to address these risks through transparency, adaptive oversight and robust alignment mechanisms. Additionally, advances in mech. interp. have highlighted the importance of understanding internal mechanisms of neural networks and constrain emergent behaviours. Omohundro’s paper should incorporate these insights too.

Lastly, the framework could also benefit from the expansion of the concept of utility preservation to include mechanisms for robust alignment and mitigation of exploitation of proxy systems. Also, frameworks can be developed for early detection of emergent properties using benchmarks and scaling law extrapolations. Finally, Omohundro’s call for an “universal constitution” remains relevant but such a constitution must evolve to incorporate global collaborations and shared methodologies to ensure safer and aligned AI systems.

Some Background

This critique was originally written in response to a question for an application. References to "this paper" or similar terms pertain to S. Omohundro's work, The Basic AI Drives.

For the sake of transparency, I would like to disclose that I took help from ChatGPT for assistance in searching for suitable references to study, summarising lengthy research papers, checking and correcting grammatical errors, and clarifying complex concepts.

The Critique

In the given paper, Omohundro argues that advanced AIs will develop rational drives like self-preservation, utility preservation, preventing counterfeit utility, and resource acquisition based on utility-maximization. While these ideas provide a valuable framework, several assumptions within the paper face challenges from modern advancements in AI research.

Assumptions related to Utility Functions

First and foremost, Omohundro assumes that AI systems will be goal-oriented with clearly defined utility functions, preserved against counterfeiting by internal or external sources. However, contemporary models like GPT-3.5 and Claude maximally operate without explicit goals. Their behaviours emerge from stochastic optimization during their training. This kind of emergence undermine the universal applicability of Omohundro's framework. Having said that, more advanced models like GPT-o1, Claude-3.5 and even GPT-4 have now begun showing preliminary indications of explicit objectives in controlled environments^[1]^[2].

Emergent Behaviours and Predictability

Regarding predictability, Omohundro frames AI behaviour as being rational and economically predictable. However, latest research on GPT-4 and Claude highlights how capabilities such as multi-step reasoning, zero-shot learning^[3], goal adaptation, or even just the simple capability to solve 3-digit addition problems emerge unpredictably and suddenly, due to phase transitions, often caused by scaling parameters^[4]. Research has also shown that AI systems can develop emergent behaviour when placed in competitive or adversarial scenarios, leading to outcomes that may not align with their initial programming^[5].

Goal Emergence and Human Parallels

In a paper titled “Challenges to the Omohundro-Bostrom framework for AI motivations” by Olle Häggström^[6], the author argues that human-like goals are often emergent and context-dependent. This parallels modern AI systems, where emergent behaviours are interpreted as goal-driven, even if no explicit goal exists. Grokking in LLMs is an example of such an emergent behaviour that hasn’t been explained completely via Mech. Interp.^[7] Such jumps in reasoning capabilities and behaviours challenge the traditional notions of gradual capability development^[8], showing that developments tied to scaling laws are discontinuous and unpredictable. Omohundro’s framework must therefore adapt to account for these abrupt and complex shifts, that defy linear rationality assumptions.

Risks from Latent Misalignment

The given paper also underestimates risks arising from latent misalignment issues, where AI systems exploit other proxy systems that diverge from true goals. An example of this is "wireheading," where models focus on gaming reward functions, or adversarially misaligned answers by GPT-4. Omohundro's emphasis on explicit drives overlooks the subtleties of how these misalignments can arise and get amplified.

Insights from Mechanistic Interpretability

Additionally, advances in Mech. Interp. reveal that the internal dynamics of modern LLMs and other AI systems are often opaque but critically determine their behaviours. Omohundro’s paper should incorporate these insights, that emphasize the importance of understanding the internal mechanisms of neural networks and constrain aforementioned emergent behaviours^[9].

Adversarial and External Interactions

Now, Omohundro’s rationality assumes isolated AI agents. However, modern research demonstrates that AI behaviours often arise from interactions with external systems or other adversarial settings. For instance, models like GPT-o1 has already shown signs of developing deceptive strategies and lying when tested against shutdown or modification attempts^[10]. These findings underscore that safety risks extend beyond economic rationality or self-preservation. AIs may exhibit behaviours that exploit training gaps as an unintended consequence of optimization if not to preserve themselves. Omohundro’s framework must address these risks through broader safety mechanisms, such as transparency and adaptive oversight.

Ethics and Moral Truths

Finally, in a paper mentioned earlier on challenges to Omohundro's Framework, Häggström introduces the idea that objective moral truths could override harmful AI goals, provided that moral realism and internalism hold true. The framework underlined in this paper by Omohundro, assumes goal integrity, but doesn’t account for such meta-level transformations.

Other Potential Updates to the Paper

Few other updates that can be made to the paper by Omohundro are as follows. The concept of utility preservation must be expanded to include mechanisms for robust alignment and mitigation of exploitation of proxy systems. This includes techniques like red-teaming, adversarial training, and improved interpretability. Also, frameworks can be developed for early detection of emergent properties using benchmarks and scaling law extrapolations. Methods such as Mech. Interp. and pre-training evaluations should become standard practices for the development of safer and human-aligned AI systems. Lastly, while Omohundro’s call for an “universal constitution” remains relevant, such a constitution should evolve to include global research collaborations on emergent behaviours and shared evaluation methodologies to predict and mitigate risks stemming from unchecked misaligned AI systems.

References Used

Please find all references in the footnotes of this blog.

0 comments

Comments sorted by top scores.

A short critique of Omohundro's "Basic AI Drives"

Contents

Summary

Some Background

The Critique

Assumptions related to Utility Functions

Emergent Behaviours and Predictability

Goal Emergence and Human Parallels

Risks from Latent Misalignment

Insights from Mechanistic Interpretability

Adversarial and External Interactions

Ethics and Moral Truths

Other Potential Updates to the Paper

References Used

0 comments