Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

chris_leong

Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

post by Chris_Leong · 2024-11-11T16:13:26.504Z · LW · GW · 6 comments

6 comments

Authors: Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Abstract: Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition - the ability to reflect on and regulate one's thought processes - is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one's knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety. By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values - a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations.

Comment: I'm mainly sharing this because of the similarity of the ideas here to my ideas in Some Preliminary Notes on the Promise of a Wisdom Explosion. The authors talk about a "virtuous cycle" in relation to wisdom in the final paragraphs. I also found their definition of wisdom quite clarifying.

6 comments

Comments sorted by top scores.

comment by Seth Herd · 2024-11-11T18:45:23.482Z · LW(p) · GW(p)

I read the whole thing because of its similarity to my proposals about metacognition as an aid to both capabilities [LW · GW] and alignment [AF · GW] in language model agents.

In this and my work, metacognition is a way to keep AI from doing the wrong thing (from the AIs perspective). They explicitly do not address the broader alignment problem of AIs wanting the wrong things (from humans' perspective).

They note that "wiser" humans are more prone to serve the common good, by taking more perspectives into account. They wisely do not propose wisdom as a solution to the problem of defining human values or beneficial action from an AI. Wisdom here is an aid to fulfilling your values, not a definition of those values. Their presentation is a bit muddled on this issue, but I think their final sections on the broader alignment problem make this scoping clear.

My proposal of a metacognitive "internal review" [AF · GW] or "System 2 alignment check" shares this weakness. It doesn't address the right thing to point an AGI at; it merely shores up a couple of possible routes to goal mis-specification.

This article explicitly refuses to grapple with this problem:

3.4.1. Rethinking AI alignment
With respect to the broader goal of AI alignment, we are sympathetic to the goal but question this definition of the problem. Ultimately safe AI may be at least as much about constraining the power of AI systems within human institutions, rather than aligning their goals.

I think limiting the power of AI systems within human institutions is only sensible if you're thinking of tool AI or weak AGI; thinking you'll constrain superhuman AIs seems like obviously a fool's errand. I think this proposal is meant to apply to AI, not ever-improving AGI. Which is fine, if we have a long time between transformative AI and real AGI.

I think it would be wildly foolish to assume we have that gap between important AI and real AGI [LW · GW]. A highly competent assistant may soon be your new boss.

I have a different way to duck the problem of specifying complex and possibly fragile human values: make the AGI's central goal to merely follow instructions. Something smarter than you wanting nothing more than to follow your instructions is counterintuitive, but I think it's both consistent, and in-retrospect obvious; I think not only is this alignment target safer, but far more likely for our first AGIs [LW · GW]. People are going to want the first semi-sapient AGIs to follow instructions, just like LLMs do, not make their own judgments about values or ethics. And once we've started down that path, there will be no immediate reason to tackle the full value alignment problem.

(In the longer term, we'll probably want to use instruction-following as a stepping-stone to full value alignment [LW · GW], since instruction-following superintelligence would eventually fall into the wrong hands and receive some really awful instructions. But surpassing human intelligence and agency doesn't necessitate shooting for full value alignment right away.)

A final note on the authors' attitudes toward alignment: I also read it because I noted Yoshua Bengio and Melanie Mitchell among the authors. It's what I'd expect from Mitchell, who has steadfastly refused to address the alignment problem, in part because she has long timelines, and in part because she believes in a "fallacy of dumb superintelligence" (I point out how she goes wrong in The (partial) fallacy of dumb superintelligence [LW · GW]).

I'm disappointed to see Bengio lend his name to this refusal to grapple with the larger alignment problem. I hope this doesn't signal a dedication to this approach. I had hoped for more from him.

comment by AnthonyC · 2024-11-11T18:11:45.532Z · LW(p) · GW(p)

Over time I am increasingly wondering how much these shortcomings on cognitive tasks are a matter of evaluators overestimating the capabilities of humans, while failing to provide AI systems with the level of guidance, training, feedback, and tools that a human would get.

Replies from: Seth Herd, Chris_Leong

↑ comment by Seth Herd · 2024-11-11T19:10:34.315Z · LW(p) · GW(p)

I think that's one issue; LLMs don't get the same types of guidance, etc. that humans get; they get a lot of training and RL feedback, but it's structured very differently.

I think this particular article gets another major factor right, where most analyses overlook it: LLMs by default don't do metacognitive checks on their thinking. This is a huge factor in humans appearing as smart as we do. We make a variety of mistakes in our first guesses (system 1 thinking) that can be found and corrected with sufficient reflection (system 2 thinking). Adding more of this to LLM agents is likely to be a major source of capabilities improvements. The focus on increasing "9s of reliability" is a very CS approach; humans just make tons of mistakes and then catch many of the important ones; LLMs sort of copy their cognition from humans, so they can benefit from the same approach - but they don't do much of it by default. Scripting it in to LLM agents is going to at least help, and it may help a lot.

↑ comment by Chris_Leong · 2024-11-12T04:56:57.348Z · LW(p) · GW(p)

That's a fascinating perspective.

Replies from: AnthonyC

↑ comment by AnthonyC · 2024-11-12T10:31:16.270Z · LW(p) · GW(p)

I think it is at least somewhat in line with your post and what @Seth Herd said in reply above.

Like, we talk about LLM hallucinations, but most humans still don't really grok how unreliable things like eyewitness testimony are. And we know how poorly calibrated humans are about our own factual beliefs, or the success probability of our plans. I've also had cases where coworkers complain about low quality LLM outputs, and when I ask to review the transcripts, it turns out the LLM was right, and they were overconfidently dismissing its answer as nonsensical.

Or, we used to talk about math being hard for LLMs, but that disappeared almost as soon as we gave them access to code/calculators. I think most people interested in AI are overestimating how bad most other people are at mental math.

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2024-11-12T11:42:11.183Z · LW(p) · GW(p)

I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.

Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

Contents

6 comments