A Case for the Least Forgiving Take On Alignment

post by Thane Ruthenis · 2023-05-02T21:34:49.832Z · LW · GW · 82 comments

Contents

  1. Introduction
  2. Why Believe This?
  3. Is "General Intelligence" a Thing?
  4. What Is "General Intelligence"?
  5. A Caveat
  6. The Case of LLMs
  7. The Subsequent Difficulties
  8. Closing Thoughts
None
82 comments

1. Introduction

The field of AI Alignment is a pre-paradigmic one, and the primary symptom of that is the wide diversity of views across it. Essentially every senior researcher has their own research direction, their own idea of what the core problem is and how to go about solving it.

The differing views can be categorized along many dimensions. Here, I'd like to focus on a specific cluster of views, one corresponding to the most "hardcore", unforgiving take on AI Alignment. It's the view held by people like Eliezer Yudkowsky, Nate Soares, and John Wentworth, and not shared by Paul Christiano or the staff of major AI Labs.

According to this view:

I share this view. In my case, there's a simple generator of it; a single belief that causes my predictions to diverge sharply from the more optimistic models.

From one side, this view postulates a sharp discontinuity, a phase change. Once a system gets to AGI, its capabilities will skyrocket, while its internal dynamics will shift dramatically. It will break "nonrobust" alignment guarantees [LW · GW]. It will start thinking in ways that confuse previous interpretability efforts. It will implement strategies it never thought of before.

From another side, this view holds that any system which doesn't have the aforementioned problems will be useless for intellectual progress. Can't have a genius engineer who isn't also a genius schemer; can't have a scientist-modeling simulator which doesn't wake up to being a shoggoth.

What ties it all together is the belief that the general-intelligence property is binary. A system is either an AGI, or it isn't, with nothing in-between. If it is, it's qualitatively more capable than any pre-AGI system, and also works in qualitatively different ways. If it's not, it's fundamentally "lesser" than any generally-intelligent system, and doesn't have truly transformative capabilities.

In the rest of this post, I will outline some arguments for this, sketch out what "general intelligence" means in this framing, do a case-study of LLMs showcasing why this disagreement is so difficult to resolve, then elaborate on how the aforementioned alignment difficulties follow from it all.

Disclaimer: This post does not represent the views of Eliezer Yudkowsky, Nate Soares, or John Wentworth. I am fairly confident that I'm pointing towards an actual divergence between their models and the models of most AI researchers, but they may (and do) disagree with the framings I'm using, or the importance I ascribe to this specific divergence.


2. Why Believe This?

It may seem fairly idiosyncratic. At face value, human cognition is incredibly complex and messy. We don't properly understand it, we don't understand how current AIs work either — whyever would we assume there's some single underlying principle all general intelligence follows? Even if it's possible, why would we expect it?

First, let me draw a couple analogies to normalize the idea.

Exhibit A: Turing-completeness. If a set of principles for manipulating data meets this requirement, it's "universal", and in its universality it's qualitatively more capable than any system which falls "just short" of meeting it. A Turing-complete system can model any computable mathematical system, including any other Turing-complete system. A system which isn't Turing-complete... can't.

Exhibit B: Probability theory. It could be characterized as the "correct" system for doing inference from a limited first-person perspective, such that anything which reasons correctly would implement it. And this bold claim has solid theoretical support: a simple set of desiderata uniquely constrains the axioms of probability theory, while any deviation from these desiderata leads to a very malfunctioning system. (See e. g. the first chapters of Jaynes' Probability Theory.)

Thus, we have "existence proofs" that (A) the presence of some qualitatively-significant capabilities is a binary variable, and (B) the mathematical structure of reality may be "constraining" some capabilities such that they can only be implemented one way.

In addition, note that both of those are "low bars" to meet — it doesn't take much to make a system Turing-complete, and the probability-theory axioms are simple.


3. Is "General Intelligence" a Thing?

Well, it's a term we use to refer to human intelligence, and humans exist, so yes. But what specifically do we mean by it? In what sense are humans "general", in what sense is it "a thing"?

Two points, mirrors of the previous pair:

Point 1: Human intelligence is Turing-complete. We can imagine and model any mathematical object. We can also chunk them, or abstract over [LW · GW] them, transforming systems of them into different mathematical objects. That process greatly decreases the working-memory load, allowing us to reason about incredibly complex systems by reducing them to their high-level behavior. A long sequence of individual chess-figure moves becomes a strategy; a mass of traders becomes a market; a sequence of words and imagined events become scenes and plot arcs.

As we do so, though, a change takes place. The resulting abstractions don't behave like the parts they're composed of, they become different mathematical objects entirely. A ball follows different rules than the subatomic particles it's made of; the rules of narrative have little to do with the rules of grammar. Yet, we're able to master all of it.

Further: Inasmuch as reductionism is true, inasmuch as there are no ontologically basic complex objects [LW · GW], inasmuch as everything can be described as a mathematical object — that implies that humans are capable of comprehending any system and problem-solving in any possible environment.

We may run into working-memory or processing limits, yes — some systems may be too complex to fit into our brain. But with pen and paper, we may be able to model even them, and in any case it's a quantitative limitation. Qualitatively speaking, human cognition is universally capable.

Point 2: This kind of general capability seems necessary. Any agent instantiated in the universe would be embedded [? · GW]: it'd need to operate in a world larger than can fit in its mind, not the least because its mind will be part of it. Fortunately, the universe provides structures to "accommodate" agents: as above, it abstracts well. There are regularities and common patterns everywhere. Principles generalize and can be compactly summarized. Lazy world-modeling [LW(p) · GW(p)] is possible.

However, that requires the aforementioned capability to model arbitrary mathematical objects. You never know what the next level of abstraction will be like, how objects on it will behave, from biology to chemistry to particle physics to quantum mechanics to geopolitics. You have to be able to adapt to anything, model anything. And if you can't do that, that means you can't build efficient world-models, and can't function as an embedded agent.

Much like reality forces any reasoner to follow the rules of probability theory, it forces any agent into... this.

Thus, (1) there is a way to be generally capable, exemplified by humans, and (2) it seems that any "generally capable" agent would need to be generally capable in the exact sense that humans are.


4. What Is "General Intelligence"?

The previous section offers one view, a view that I personally think gets at the core of it. One of John Wentworth's posts [LW · GW] presents a somewhat different frame, as does this post of nostalgebraist's [LW · GW].

Here's a few more angles to look at it from:

  1. It's something like "the ability to navigate any environment". It's a set of capabilities that allow to construct and "understand" arbitrary mathematical objects, manipulate them, and fluidly incorporate them into problem-solving.
  2. It's a "heuristics generator". It's some component of cognition such that, when prompted with an environment, it quickly converges towards some guidelines for good performance in it — without needing a lot of trial-and-error.
  3. It's a principled way of drawing upon the knowledge contained in the world-model. World-models are likely nicely-structured [LW · GW], and general intelligence is the ability to stay up-to-date on your world-model and run queries on it most relevant to your current task. Instead of learning what to query for by painful experience, a general intelligence can instantly "loop in" even very surprising information, as long as it becomes represented in its world-model.
  4. It's consequentialism/agency [LW · GW]: the ability to instantly adapt one's policy in response to changes in the environment and stay aimed at your goal. Rather than retrieving a cached solution, it's the ability to solve the specific problem you're presented with; to always walk the path to the desired outcome because [LW · GW] it's the path to the desired outcome.
  5. It's autonomy [LW · GW]: the ability to stay "on-track" when working across multiple environments and abstraction levels; without being distracted, pulled in different directions, or completely stumped.

There's a number of threads running through these interpretations:

The goal-directedness is the particularly important part. To be clear: by it, I don't mean that a generally intelligent mind ought to have a fixed goal it’s optimizing for [LW · GW]. On the contrary, general intelligence’s generality extends to being retargetable towards arbitrary objectives. But every momentary step of general reasoning is always a step towards some outcome. Every call of the function implementing general intelligence has to take in some objective as an input — else it's invalid, a query on an empty string.

Goal-directedness, thus, is inextricable from general intelligence. “The vehicle of cross-domain goal-pursuit” is what intelligence is.


5. A Caveat

One subtle point I've omitted so far is that, while achieving generality is supposed to cause systems to dramatically jump in capabilities, it doesn't have to happen instantly. A system may need to "grow" into its intelligence. The mechanistic discontinuity, when the core of general intelligence is assembled, would slightly precede the "capabilistic" one, when the potential of that core is properly realized.

The homo sapiens sapiens spent thousands of years hunter-gathering before starting up civilization, even after achieving modern brain size. Similarly, when whatever learning algorithm we're using builds general intelligence into an AI, it would not instantly become outsmart-all-of-humanity superintelligent (well, probably not).

The reason is that, while general-intelligence algorithms are equal in their generality, that doesn't mean generally-intelligent minds don't vary on other axes.

So when the GI component first coalesces, it may have very little compute for itself, it may not be often employed, it may defer to heuristics in most cases, and the wider system wouldn't yet know how to employ it well.[3] It would still be generally capable in the limit, but it wouldn't be instantly omnicide-capable. It would take some time for the result of the mechanistic discontinuity to become properly represented at the level of externally-visible capabilities.

Thus, in theory, there may be a small margin of capability where we'd have a proper AGI that nonetheless can't easily take over us. At face value, seems like this should invalidate the entire "we won't be able to iterate on AGI systems" concern...

The problem is that it'd be very, very difficult to catch that moment and to take proper advantage of it. Most approaches to alignment are not on track to do it. Primarily, because those approaches don't believe in the mechanistic discontinuity at all, and don't even know that there's some crucial moment to be carefully taking advantage of.

There's three problems:

1) A "weak" AGI is largely a pre-AGI system.

Imagine a "weak" AGI as described above. The GI component doesn't have much resources allocated to it, it's often overridden, so on. Thus, that system's cognitive mechanics and behavior are still mostly determined by specialized problem-solving algorithms/heuristics, not general intelligence. The contributions of the GI component are a rounding error.

As such, most of the lessons we learn from naively experimenting with this system would be lessons about pre-AGI systems, not AGI systems! There would be high-visible-impact interpretability or alignment techniques that ignore the GI component entirely, since it's so weak and controls so little. On the flip side, no technique that spends most of its effort on aligning the GI component would look cost-effective to us.

Thus, unless we deliberately target the GI component (which requires actually deciding to do so, which requires knowing that it exists and is crucial to align), iterating on a "weak" AGI will just result in us developing techniques for pre-AGI systems. Techniques that won't scale once the "weak" label falls off.

Conversely, the moment the general-intelligence component does become dominant — the moment any alignment approach would be forced to address it — is likely the moment the AI becomes significantly smarter than humans. And at that point, it'd be too late to do alignment-by-iteration.

The discontinuity there doesn't have to be as dramatic as hard take-off/FOOM is usually portrayed. The AGI may stall at a slightly-above-human capability, and that would be enough. The danger lies in the fact that we won't be prepared for it, would have no tools to counteract its new capabilities at all. It may not instantly become beyond humanity's theoretical ability to contain — but it'd start holding the initiative, and will easily outpace our efforts to catch up. (Discussing why even "slightly" superintelligent AGIs are an omnicide risk is outside the scope of this post; there are other materials [LW · GW] that cover this well.)

Don't get me wrong: having a safely-weak AGI at hand to experiment with would be helpful for learning to align even "mature" AGIs. But we would need to make very sure that our experiments are targeting the right feature of that system. Which, in all likelihood, requires very strong interpretability tools: we'd need "a firehose of information" [LW · GW] on the AI's internals to catch the moment.

2) We may be in an "agency overhang". As nostalgebraist's post on autonomy [LW · GW] mentions, modern AIs aren't really trained to be deeply agentic/goal-directed. Arguably, we don't yet know how to do it at all. It may require a paradigm shift similar to the invention of transformers.

And yet, modern LLMs are incredibly capable even without that. If we assume they're not generally intelligent, that'd imply they have instincts dramatically more advanced than any animal's. So advanced we often mistake them for AGI!

Thus, the concern: the moment we figure out how to even slightly incentivize general intelligence, the very first AGI will become strongly superintelligent. It'd be given compute and training far in excess of what AGI "minimally" needs, and so it'd instantly develop general intelligence as far ahead of humans' as LLMs' instincts are ahead of human instincts. The transition between the mechanistic and the capabilitisc discontinuity would happen within a few steps of a single training run — so, effectively, there wouldn't actually be a gap between them.

In this case, the hard take-off will be very hard indeed.

A trick that we might try is deliberately catching AGI in-training: Design interpretability tools for detecting the "core of general intelligence", continuously run them as we train. The very moment they detect GI forming, we stop the training, and extract a weak, omnicide-incapable AGI. We then do iterative experimentation on it as usual (although that would be non-trivial to get right as well, see point 1).

That still has some problems:

  1. It'd require fairly advanced interpretability tools [LW · GW], tools we don't yet have.
  2. The transition from a "weak" AGI to a superintelligence may be very fast, so we'd need to pause-and-interpret the model very frequently during the training. That'd potentially significantly increase the costs and time required.
  3. The resultant "weak AGI" may still be incredibly dangerous [LW · GW]. Not instantly omnicidal, but perhaps on the very verge of that. (Consider how dangerous the upload of a human genius would be.)

I do think this can be a component of some viable alignment plans. But it's by no means trivial.

3) We may not notice "weak" AGI while staring right at it.

The previous possibility assumed that modern LLMs are not AGI. Except, how do we know that?


6. The Case of LLMs

I'll be honest: LLMs freak me out as much as they do anyone. As will be outlined, I have strong theoretical reasons to believe that they're not generally intelligent, and that general intelligence isn't reachable by scaling them up. But looking at some of their outputs sure makes me nervously double-check my assumptions.

There's a fundamental problem: in the case of AI, the presence vs. absence of general intelligence at non-superintelligent levels is very difficult to verify externally. I've alluded to it some already, when mentioning that "weak" AGIs, in their makeup and behavior, are mostly pre-AGI systems.

There are some obvious tell-tale signs in both directions. If it can only output gibberish, it's certainly not an AGI; if it just outsmarted its gatekeepers and took over the world, it's surely an AGI. But between the two extremes, there's a grey area. LLMs are in it.

To start the analysis off, let's suppose that LLMs are entirely pre-AGI. They don't contain a coalesced core of true generality, not even an "underfunded" one. On that assumption, how do they work?

Suppose that we prompt a LLM with the following:

vulpnftj = -1
3 + vulpnftj =

LLMs somehow figure out that the answer is "2". It's highly unlikely that "vulpnftj" was ever used as a variable in their training data, yet they somehow know to treat it as one. How?

We can imagine that there's a "math engine" in there somewhere, and it has a data structure like "a list of variables" consisting of {name; value} entries. The LLM parses the prompt, then slots "vulpnftj" and "-1" into the corresponding fields. Then it knows that "vulpnftj" equals "-1".

That's a kind of "learning", isn't it? It lifts completely new information from the context and adapts its policy to suit. But it's a very unimpressive kind of learning. It's only learning in a known, pre-computed way.

I claim that this is how LLMs do everything. Their seeming sophistication is because this trick scales far up the abstraction levels.

Imagine a tree of problem-solving modules, which grow increasingly more abstract as you ascend. At the lowest levels, we have modules like "learn the name of a variable: %placeholder%". We go up one level, and see a module like "solve an arithmetic equation", with a field for the equation's structure. Up another level, and we have "solve an equation", with some parameters that, if filled, can adapt this module for solving arithmetic equations, differential equations, or some other kinds of equations (even very esoteric ones). Up, up, up, and we have "do mathematical reasoning", with parameters that codify modules for solving all kinds of math problems.

When an LLM analyses a prompt, it figures out it's doing math, figures out what specific math is happening, slots all that data in the right places, and its policy snaps into the right configuration for the problem.

And if we go sideways from "do math", we'd have trees of modules for "do philosophy", "do literary analysis", "do physics", and so on. If we'd instead prompted it with a request to ponder the meaning of life as if it were Genghis Khan, it would've used different modules, adapted its policy to the context in different ways, called up different subroutines. Retrieve information about Genghis Khan, retrieve the data about the state of philosophy in the 13th century, constrain the probability distribution over the human philosophical outlook by these two abstractions, distill the result into a linguistic structure, extract the first token, output it...

A wealth of possible configurations like this, a combinatorically large number of them, sufficient for basically any prompt you may imagine.

But it's still, fundamentally, adapting in known ways. It doesn't have a mechanism for developing new modules; the gradient descent has always handled that part. The LLM contains a wealth of crystallized intelligence, but zero fluid intelligence. A static set of abstractions it knows, a closed range of environments it can learn to navigate. Still "just" interpolation. [LW · GW]

For state-of-the-art LLMs, that crystallized structure is so extensive it contains basically every abstraction known to man. Therefore, it's very difficult to come up with some problem, some domain, that they don't have an already pre-computed solution-path for.

Consider also the generalization effect. The ability to learn to treat "vulpnftj" as a variable implies the ability to learn to treat any arbitrary string as a variable. Extending that, the ability to mimic the writing styles of a thousand authors implies the ability to "slot in" any style, including one a human talking to it has just invented on the fly. The ability to write in a hundred programming languages... implies, perhaps, the ability to write in any programming language. The mastery of a hundred board games generalizes to the one-hundred-and-first one, even if that one is novel. And so on.

In the limit, yes, that goes all the way to full general intelligence. Perhaps the abstraction tree only grows to a finite height, perhaps there are only so many "truly unique" types of problems to solve.

But the current paradigm may be a ruinously inefficient way to approach that limit [LW · GW]:

There are lots of algorithms which are Turing-complete or ‘universal’ in some sense; there are lots of algorithms like AIXI [? · GW] which solve AI in some theoretical sense (Schmidhuber & company have many of these cute algorithms such as ‘the fastest possible algorithm for all problems’, with the minor catch of some constant factors which require computers bigger than the universe).

Why think pretraining or sequence modeling is not another one of them? Sure, if the model got a low enough loss, it’d have to be intelligent, but how could you prove that would happen in practice?

Yet it still suffices to foil the obvious test for AGI-ness, i. e. checking whether the AI can be "creative". How exactly do you test an LLM on that? Come up with a new game and see if it can play it? If it can, that doesn't prove much. Maybe that game is located very close, in the concept-space, to a couple other games the LLM was already fed, and deriving the optimal policy for it is as simple as doing a weighted sum of the policies for the other two.

Some tasks along these lines would be a definitive proof — like asking it to invent a new field of science on the fly. But, well, that's too high a bar. Not any AGI can meet it, only a strongly superintelligent AGI, and such a system would be past the level at which it can defeat humanity. It'd be too late to ask it questions then, because it'll have already eaten us.

I think, as far as current LLMs are concerned, there's still some vague felt-sense in which all their ideas are "stale". In-distribution for what humanity has already produced, not "truly" novel, not as creative as even a median human. No scientific breakthroughs, no economy-upturning startup pitches, certainly no mind-hacking memes. Just bounded variations on the known. The fact that people do this sort of stuff, and nothing much comes of it, is some evidence for this, as well.

It makes sense in the context of LLMs' architecture and training loops, too. They weren't trained to be generally and autonomously intelligent; their architecture is a poor fit for that in several ways.

But how can we be sure?

The problem, fundamentally, is that we have no idea how the problem-space looks like. We don't know and can't measure in which directions it's easy to generalize or not, we don't know with precision how impressive AI is getting. We don't know how to tell an advanced pre-AGI system from a "weak" AGI, and have no suitable interpretability tools for a direct check.

And thus we'd be unable to tell when AI — very slowly at first, and then explosively — starts generalizing off-distribution, in ways only possible for the generally intelligent, arbitrary-environment-navigating, goal-directed things. We'd miss the narrow interval in which our AGIs were weak enough that we could survive failing to align them and get detailed experience from our failures (should there be such an interval at all). And the moment at which it'll become clear that we're overlooking something, would be the exact moment it'd be too late to do anything about it.

That is what "no fire alarm" means.


7. The Subsequent Difficulties

All right, it's finally time to loop back around to our initial concerns. Suppose general intelligence is indeed binary, or "approximately" so. How does just that make alignment so much harder?

At the fundamental level, this means that AGI-level systems work in a qualitatively different way from pre-AGI ones. Specifically, they think in a completely novel way. The mechanics of fluid intelligence — the processes needed to efficiently derive novel heuristics, to reason in a consequentialist manner — don't resemble the mechanics of vast crystallized-intelligence structures.

That creates a swath of problems. Some examples:

It breaks "weak" interpretability tools. If we adapt them to pre-AGI systems, they would necessarily depend on there being a static set of heuristics/problem-solving modules. They would identify modules corresponding to e. g. "deception", and report when those are in use. A true AGI, however, would be able to spin off novel modules that fulfill a similar function in a round-about way [LW · GW]. Our tools would not have built-in functionality for actively keeping up with a dynamically morphing system, so they will fail to catch that, to generalize. (Whereas destroying the AI's ability to spin off novel modules would mean taking the "G" out of "AGI".)

As I'd mentioned, for these purposes "weak" AGIs are basically equivalent to pre-AGI systems. If the general-intelligence component isn't yet dominant, it's not doing this sort of module-rewriting at scale. So interpretability tools naively adapted for "weak" AGIs would be free to ignore that aspect, and they'd still be effective... And would predictably fail once the GI component does grow more powerful.

It breaks "selective" hamstringing. Trying to limit an AGI's capabilities, to make it incapable of thinking about harming humans or deceiving them, runs into the same problem as above. While we're operating on pre-AGI systems, mechanistically this means erasing/suppressing the corresponding modules. But once we get to AGI, once the system can create novel modules/thought-patterns on the fly... It'd develop ways to work around [LW · GW].

It breaks "nonrobust" goal-alignment. In a pre-AGI system, the "seat of capabilities" are the heuristics, i. e. the vast crystallized-intelligence structures of problem-solving modules.  "Aligning" them, to wit, means re-optimizing these heuristics such that the AI reflexively discards plans that harm humans, and reflexively furthers plans that help humans. If we take on the shard-theory frame [? · GW], it can mean cultivating a strong shard that values humans, and bids to protect their interests.

Aligning an AGI is a different problem. Shards/heuristics are not the same thing as the goals/mesa-objectives the AGI would pursue — they're fundamentally different types of objects. If it works anything like how it does in humans, perhaps mesa-objectives would be based on or inspired by shards. But how exactly the general-intelligence module would interpret them is under question. It's unlikely to be a 1-to-1 mapping, however: much like human emotional urges and instincts do not map 1-to-1 to the values we arrive at via moral philosophy.

One thing that seems certain, however, is that shards would lose direct control over the AGI's decisions. It would be an internal parallel to what would happen to our pre-AGI interpretability or hamstringing tools — heuristics/shards simply wouldn't have the machinery to automatically keep up with an AGI-level system. The aforementioned "protect humans" shard, for example, would only know to bid against plans that harm humans in some specific mental contexts, or in response to specific kinds of harm. Once the AGI develops new ways to think about reality, the shard would not even know to try to adapt. And afterwards, if the GI component were so inclined, it would be able to extinguish that shard, facing no resistance.

A human-relatable parallel would be someone going to exposure therapy to get rid of a phobia, or a kind person deciding to endorse murder when thinking about it in a detached utilitarian framework. When we reflect upon our values, we sometimes come to startling results, or decide to suppress our natural urges — and we're often successful in that.

Pre-AGI alignment would not necessarily break — if it indeed works like it does in humans. But the process of value reflection seems highly unstable [LW · GW], and its output is a non-linear function of the entirety of the initial desires. "If there's a shard that values humans, the AGI will still value humans post-reflection" does not hold, by default. "Shard-desires are more likely to survive post-reflection the stronger they are, and the very strong will definitely survive" is likewise invalid [LW · GW].

Thus, the alignment of a pre-AGI system doesn't guarantee that this system will remain aligned past the AGI discontinuity; and it probably wouldn't [LW · GW]. If we want to robustly align an AGI, we have to target the GI component directly, not through the unreliable proxy of shards/heuristics.

It leads to a dramatic capability jump. Consider grokking [LW · GW]. The gradient descent gradually builds some algorithmic machinery into an AI. Then, once it's complete, that machinery "snaps together", and the AI becomes sharply more capable in some way. The transition from a pre-AGI system to a mature AGI can be viewed as the theorized most extreme instance of grokking — that's essentially what the sharp left turn [LW · GW] is.

Looking at it from the outside, however, we won't see the gradual build-up (unless, again, we have very strong interpretability tools specifically for that). We'd just see the capabilities abruptly skyrocketing, and generalizing in ways we haven't seen before. In ways we didn't predict, and couldn't prepare for.

And it would be exactly the point at which things like recursive self-improvement become possible. Not in the sort of overdramatic way in which FOOM is often portrayed, but in the same sense in which a human trying to get better at something self-improves, or in which human civilization advances its industry.

Crucially, it would involve an AI whose capabilities grow as the result of its own cognition; not as the result of the gradient descend improving it. A static tree of heuristics, no matter how advanced, can't do that. A tree of heuristics deeply interwoven with the machinery for deriving novel heuristics... can.

Which, coincidentally, is another trick that tools optimized for the alignment of pre-AGI systems won't know how to defeat.

The unifying theme is that we won't be able to iterate [LW · GW]. Pre-AGI interpretability, safeguards, alignment guarantees, scaling laws, and all other approaches that fail to consider the AGI discontinuity — would ignobly fail at the AGI discontinuity.

As per Section 5, in theory iteration is possible. Not all AGIs are superhuman, and we can theoretically "catch" a "weak" AGI, and experiment with it, and derive lessons from that experimentation that would generalize to strongly superintelligent systems. But that's incredibly hard to do right without very advanced interpretability tools, and the situation would likely be highly unstable, with the "caught" AGI still presenting a massive threat.

Okay, so AGI is highly problematic. Can we manage without it?

Can "limitedly superhuman" AIs suffice? That is, systems that have superhuman competencies in some narrow and "safe" domains, like math. Or ones that don't have "desires", like oracles or simulators. Or ones that aren't self-reflective, or don't optimize too strongly, or don't reason in a consequentialist manner...

It should be clear, in the context of this post, that this is an incoherent design specification. Useful creativity, truly-general intelligence, and goal-directedness are inextricable from each other. They're just different ways of looking at the same algorithm.

On this view, there aren't actually any "domains" in which general intelligence can be "specialized". Consider math. Different fields of it consist of objects that behave in drastically different ways, and inventing a novel field would require comprehending a suite of novel abstractions and navigating them. If a system can do that, it has the fundamental machinery for general intelligence, and therefore for inventing deception and strategic scheming. If it can't... Well, it's not much use.

Similar for physics, and even more so for engineering. If math problems can be often defined in ways that don't refer to the physical reality at all, engineering problems and design specifications would talk about reality. To solve such problems, an AGI would need not only the basic general-intelligence machinery, but also a suite of crystallized intelligence modules for reasoning about reality. Not just the theoretical ability to learn how to achieve real goals, but the actual knowledge of it.

Most severely it applies to various "automate alignment" ideas. Whether by way of prompting a simulator [LW · GW] to generate future alignment results, or by training some specialized "research assistant" model for it... Either the result won't be an AGI, and therefore won't actually contribute novel results, or it would be an AGI, and therefore an existential threat.

There's nothing in-between.

What about generative world-models/simulators [LW · GW], specifically? This family of alignment proposals is based on the underlying assumption that a simulator itself is goal-less. It's analogized to the laws of physics — it can implement agents, and these agents are dangerous and in need of alignment... But the simulator is not an agent of its own, and not a threat.

The caveat is that a simulator is not literally implemented as a simulation of physics (or language), even if it can be viewed as such. That would be ruinously compute-intensive, far in excess of what LLMs actually consume. No, mechanistically, it's a complex suite of heuristics. A simulator pushed to AGI, then, would consist of a suite of heuristics in control of a generally-intelligent goal-directed process... Same as, say, any reinforcement-learning agent.

Expecting that to keep on being a simulator is essentially expecting this AGI to end up inner-aligned to the token-prediction objective. And there's no reason to expect that in the case of simulators, any more than there's reason to expect it for any other training objective.

In the end, we will get an AGI with some desires that shallowly correlate with token-prediction, a "shoggoth" as it's often nicknamed. It will reflect on its desires, and come to unpredictable, likely omnicidal conclusions. Business as usual.

What about scalable oversight, such as pursued by OpenAI? Its failure follows from the intersection of a few ideas discussed above. The hard part of the alignment problem is figuring out how to align the GI component. If we're not assuming that problem away, here, the AIs doing the oversight would have to be pre-AGI models (which we roughly do know how to align). But much like weak interpretability tools, or shards, these models would not be able to keep up with AGI-level shifting cognition. Otherwise, they wouldn't be "pre"-AGI, since this sort of adaptability is what defines general intelligence.

And so we're back at square one.

Thus, once this process scales to AGI-level models, its alignment guarantees will predictably break.


8. Closing Thoughts

To sum it all up: As outlined here, I'm deeply skeptical, to the point of dismissiveness, of a large swathe of alignment approaches. The underlying reason is a model that assumes a sharp mechanistic discontinuity at the switch to AGI. Approaches that fail to pay any mind to that discontinuity, thus, look obviously doomed to me. Such approaches miss the target entirely: they focus on shaping the features of the system that play a major part now, but will fall into irrelevance once general intelligence forms, while ignoring the component of AI that will actually be placed in charge at the level of superintelligence.

In addition, there's a pervasive Catch-22 at play. Certain capabilities, like universally flexible adaptability and useful creativity, can only be implemented via the general-intelligence algorithm. As the result, there's no system that can automatically adapt to the AGI discontinuity except another generally-intelligent entity. Thus, to align an AGI, we either need an aligned AGI... or we need to do it manually, using human general intelligence.

It's worth stating, however, that I don't consider alignment to be impossible, or even too hard to be realistically solved. While Eliezer/Nate may have P(doom) at perhaps 90+%, John expects survival with "better than a 50/50 chance" [LW · GW], and I'm leaning towards the latter estimate as well.

But what I do think is that we won't get to have shortcuts and second chances. Clever schemes for circumventing or easing the alignment problem won't work, and reality won't forgive us for not getting it exactly right.

By the time we're deploying AGI, we have to have a precise way of aiming such systems [LW · GW]. Otherwise, yes, we are hopelessly doomed.

  1. ^

    A general intelligence may also be suppressed by an instinct firing off, as sometimes happens with humans. But that’s a feature of the wider mind the GI is embedded in, not of general intelligence itself.

  2. ^

    This is one of the places where my position seems at odds with e. g. Eliezer's, although I think the disagreement is largely semantical. He sometimes talks about AIs that are "more general" than humans, providing an example of an AI capable of rewriting its cognitive algorithms on the fly to be able to write bug-free code. Here, he doesn't make a distinction between the fundamental capabilities of the general-intelligence algorithm, and the properties of a specific mind in which GI is embedded.

    Imagine an AGI as above, able to arbitrarily rewrite its mental subroutines, but with a twist: there's a secondary "overseer" AGI on top of it, and its sole purpose is to delete the "program perfectly in Python" module whenever the first AGI tries to create it. The system as a whole would be "less general" than the first AGI alone, but not due to some lacking algorithmic capability.

    Similar with humans: we possess the full general-intelligence algorithm, it just doesn't have write-access to certain regions of our minds.

  3. ^

    Or it may be instantly given terabytes of working memory, an overriding authority, and a first task like "figure out how to best use yourself" which it'd then fulfill gloriously. That depends on the exact path [LW · GW] the AI's model takes to get there: maybe the GI component would grow out of some advanced pre-GI planning module, which would've already enjoyed all these benefits?

    My baseline prediction is that it'd be pretty powerful from the start. But I will be assuming the more optimistic scenario in the rest of this post: my points work even if the GI starts out weak.

82 comments

Comments sorted by top scores.

comment by cfoster0 · 2023-05-02T23:29:38.254Z · LW(p) · GW(p)

Agreed that this (or something near it) appears to be a relatively central difference between people's models, and probably at the root of a lot of our disagreement. I think this disagreement is quite old; you can see bits of it crop up in Hanson's posts on the "AI foom" concept way back when. I would put myself in the camp of "there is no such binary intelligence property left for us to unlock". What would you expect to observe, if a binary/sharp threshold of generality did not exist?

A possibly-relevant consideration in the analogy to computation is that the threshold of Turing completeness is in some sense extremely low (see one-instruction set computer, Turing tarpits, Rule 110), and is the final threshold. Rather than a phase shift at the high end, where one must accrue a bunch of major insights before one has a system that they can learn about "computation in general" from, with Turing completeness, one can build very minimal systems and then--in a sense--learn everything that there is to learn from the more complex systems. It seems plausible to me that cognition is just like this. This raises an additional question beyond the first: What would you expect to observe, if there indeed is binary/sharp threshold but it is very low, such that we've already crossed it? (Say, if circa-1995 recurrent neural nets already had the required stuff to be past the threshold.) That would be compatible with thinking that insights from interpretability etc. work on pre-threshold systems wouldn't generalize to post-threshold systems, but also compatible with believing that we can do iterative design right now.

Re: LLMs, I dunno if I buy your story. At face value, what we've seen appears like another instance of the pattern where capabilities we once thought required some core of generality (doing logic & math, planning, playing strategic games, understanding language, creating art, etc.) turned out to be de-composable as any other technology is. That this pattern continues again and again over the decades makes me skeptical that we'll be unable to usefully/safely get the capabilities we want out of AI systems due to the sort of sharp threshold you imagine.

Replies from: Thane Ruthenis, sharmake-farah
comment by Thane Ruthenis · 2023-05-03T14:20:18.074Z · LW(p) · GW(p)

What would you expect to observe, if a binary/sharp threshold of generality did not exist?

Great question!

I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.

E. g., there'd be things like, we'd see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening.  As it is, sure, some animals have specialized cognitive algorithms that may be better than human ones in their specific niches, but we seem to always be able to comprehend them. We can always figure out why they decide to execute various plans, based on what evidence, and how these plans lead to whatever successes they achieve. A human can model any animal's cognition; a human's cognition is qualitatively more capable than any animal's. If true generality didn't exist, I'd expect that not to be true.

Scaling it up, the universe as a whole would be more incomprehensible. I'd referred to ontologically complex processes [LW · GW] when discussing that in Section 3 — processes such that there are no cognitive features in our minds that would allow us to emulate them. That'd be the case all over the place: we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning.

The fact that we observe neither (and that this state of affairs is even hard/impossible for us to imagine) suggests that we're fully general, in the sense outlined in the post.

the threshold of Turing completeness is in some sense extremely low (see one-instruction set computer, Turing tarpits, Rule 110), and is the final threshold

Yup. But I think there are some caveats here. General intelligence isn't just "some cognitive system that has a Turing-complete component inside it", it's "a Turing-complete system for manipulating some specific representations". I think general intelligence happens when we amass some critical mass of shards/heuristics + world-model concepts they're defined over, then some component of that system (planner? shard-bid resolver? cross-heuristic communication channel? rules for querying the WM?) becomes a weird machine, and then that weird-machine-ness is harnessed for cognition. (Though it may not be a good idea to discuss the specifics publicly.)

What I would expect to observe if that weren't the case... I would expect GOFAI to have worked. If universally-capable cognition is not only conceptually simple at a high level (which I believe it is), but also doesn't require a mountain of complexly-formatted data on which to work, I'd expect us to have cracked it last century. No need for all this ML business.

Replies from: cfoster0
comment by cfoster0 · 2023-05-03T18:00:43.552Z · LW(p) · GW(p)

Thanks! Appreciate that you were willing to go through with this exercise.

I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.

[...]

we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning.

Per reductionism, nothing should be fundamentally incomprehensible or fundamentally beyond reckoning, unless we posit some binary threshold of reckoning-generality. Everything that works reliably operates by way of lawful/robust mechanisms [LW · GW], so arriving at comprehension should look like gradually unraveling those mechanisms, searching for the most important pockets of causal/computational reducibility. That requires investment in the form of time and cumulative mental work [LW · GW].

I think that the behavior of other animals & especially the universe as a whole in fact did start off as very incomprehensible to us, just as incomprehensible as it was to other species. In my view, what caused the transformation from incomprehensibility to comprehensibility was not humans going over a sharp cognitive/architectural threshold, such that on one side their minds were fundamentally unable to understand these things and on the other they were able. Rather, the advent of language & cultural transmission enabled humans over time to pool/chain together their existing abilities to observe the world, retain knowledge, & build better tools such as mental models and experimental instruments. (I believe these "lifetime learning abilities" are shared with many other animals despite their lacking language.) That accumulation of mental work over time is what enabled the seemingly-sharp change relative to historical timescales when humans entered the scene [LW · GW], in my view.

Yup. But I think there are some caveats here. General intelligence isn't just "some cognitive system that has a Turing-complete component inside it", it's "a Turing-complete system for manipulating some specific representations". [...] (Though it may not be a good idea to discuss the specifics publicly.)

I don't think I understand you here, but it sounds like this is something sensitive so I won't probe.

What I would expect to observe if that weren't the case... I would expect GOFAI to have worked. If universally-capable cognition is not only conceptually simple at a high level (which I believe it is), but also doesn't require a mountain of complexly-formatted data on which to work, I'd expect us to have cracked it last century. No need for all this ML business.

(emphasis mine) Hold on: why is that particular additional assumption relevant? A low threshold for generality does not imply that cognitive capabilities are easy or efficient to acquire once you've crossed the threshold. It does not imply that you just have to turn on one of these "universally-capable cognition" machines, without requiring additional hard/expensive/domain-specific work (trial & error, gradient descent over lots of examples, careful engineering, cultural accumulation, etc.) to search for useful cognitive strategies to run on that machine. Analogously, the fact that even very minimal systems can act as Universal Turing Machines does not mean that it is easy to find programs for those systems that exhibit a desired behavior, or that Turing completeness provides some sort of efficient/general-purpose shortcut.

For the record, I think GOFAI did/does work! We now have high-level programming languages, planning algorithms, proof assistants and computer algebra systems, knowledge graphs, decision trees, constraint solvers, etc. etc. all of which are working + economically productive and fall under symbolic AI. It just turned out that different cognitive capabilities benefit from different algorithms, so as we crack different capabilities, the boundaries of "AI" are redrawn to focus on problems that haven't been automated yet.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-04T12:13:22.410Z · LW(p) · GW(p)

Per reductionism, nothing should be fundamentally incomprehensible or fundamentally beyond reckoning

Exactly; see my initial points about Turing-completeness. But exploiting this property of reality, being able to "arrive at comprehension by gradually unraveling the mechanisms by which the world works", is nonetheless a meaningfully nontrivial ability.

Consider an algorithm implementing a simple arithmetic calculator, or a symbolic AI from a FPS game, or LLMs as they're characterized in this post. These cognitive systems do not have the machinery to arrive at understanding this way. There are no execution-paths of their algorithms such that they arrive at understanding; no algorithm-states that correspond to "this system has just learned a new physics discipline". This is how I view animals as well.

If true generality doesn't exist, it would stand to reason that humans are the same. There should be aspects of reality such that there's no brain-states of us that correspond to us understanding them; there should only be a limited range of abstract objects our mental representations can support.

The ability to expand our mental ontology in a controlled manner, and stay in lockstep with this expansion, always able to fluidly employ for problem-solving the new concepts learned, is exactly the ability I associate with general intelligence. The existence of calculators/FPS AI/maybe-LLMs, which are incapable of this, shows that this isn't a trivial ability. And the suggestive connection with Turing-completeness hints that it may be binary.

Maybe "falls into the basin of being able to understand anything" would be a clearer way to put it?

Hold on: why is that particular additional assumption relevant? 

Hmm, maybe I didn't understand your hypothetical:

What would you expect to observe, if there indeed is binary/sharp threshold but it is very low, such that we've already crossed it?

To me, this sounds like you're postulating the existence of a simple algorithm for general-purpose problem-solving which is such that it would be convergently learned by circa-1995 RNNs. Rephrasing, this hypothetical assumes that the same algorithm can be applied to efficiently solve a wide variety of problems, and that it can usefully work even at the level of complexity at which 1995-RNNs were operating.

If so, I would expect humanity to have discovered it manually. We would be employing it all over the place: programming language interpreters, calculators, sorting algorithms, image recognition, all of that software would be known to use the exact same manually-written algorithm to do its work. Since it's simple and efficient (not-that-deep very-compute-limited RNNs learned it) and widely useful (all these RNNs convergently learned it).

Conversely, my position is that the algorithm for general intelligence is only useful if it's operating on a complicated world-model + suite of heuristics: there's a threshold of complexity and compute requirements (which circa-1995 RNNs were far below), and general intelligence is an overkill to use for simple problems (so RNNs wouldn't have convergently learned it; they would've learned narrow specialized algorithms instead).

Replies from: cfoster0
comment by cfoster0 · 2023-05-06T00:00:50.198Z · LW(p) · GW(p)

I think I am confused where you're thinking the "binary/sharp threshold" is.

Are you saying there's some step-change in the architecture of the mind, in the basic adaption/learning algorithms that the architecture runs, in the content those algorithms learn? (or in something else?)

If you're talking about...

  • ... an architectural change → Turing machines and their neural equivalents, for example, over, say, DFAs and simple associative memories. There is a binary threshold going from non-general to general architectures, where the latter can support programs/algorithms that the former cannot emulate. This includes whatever programs implement "understanding an arbitrary new domain" as you mentioned. But once we cross that very minimal threshold (namely, combining memory with finite state control), remaining improvements come mostly from increasing memory capacity and finding better algorithms to run, neither of which are a single binary threshold. Humans and many non-human animals alike seem to have similarly general architectures, and likewise general artificial architectures have existed for a long time, so I would say "there indeed is a binary/sharp threshold [in architectures] but it is very low, such that we've already crossed it".
  • ... a change in algorithm → Model-based RL, General Problem Solver, AIXI, the Gödel machine algorithm, gradient descent over sufficiently massive datasets are candidates for algorithms that can do or learn to do "general-purpose problem-solving". But none of these are efficient in general, and I don't see any reason to think that there's some secret-sauce algorithm like them distinguishing human thinking from that of non-human animals. Other animals remember their experiences, pursue goals, creatively experiment with different strategies, etc. It seems much more plausible to me that other animals (incl. our primate cousins) are running similar base learning/processing algorithms on similar (but possibly smaller capacity) hardware, & the game-changer was that humans were able to accumulate more/better learned content for those algorithms to leverage.
  • ... a change in content→ I agree that there was a massive change here, and I think this is responsible for the apparent differences in capabilities. Earlier I claimed that this happened because the advent of language & culture allowed content to accumulate in ways that were previously not feasible. But the accumulation of content was a continuous process, we didn't acquire some new binary property. Moreover, these continuous changes in content as a function of our learning process + data are exactly the kind of changes that we're already used to supervising in ML, & exactly where we are already expending our efforts. Why will this blindside us?

Consider an algorithm implementing a simple arithmetic calculator, or a symbolic AI from a FPS game, or LLMs as they're characterized in this post. These cognitive systems do not have the machinery to arrive at understanding this way. There are no execution-paths of their algorithms such that they arrive at understanding; no algorithm-states that correspond to "this system has just learned a new physics discipline". [...]

If true generality doesn't exist, it would stand to reason that humans are the same. There should be aspects of reality such that there's no brain-states of us that correspond to us understanding them; there should only be a limited range of abstract objects our mental representations can support.

When you say "machinery" here it makes me think you're talking about architecture, but in that case the lack of execution-paths that arrive at learning new physics seems like it is explained by "simple arithmetic calculators + FPS AIs + LLMs are not Turing-complete systems / have too little memory / are not running learning algorithms at all", without the need to hypothesize a separate "general intelligence" variable.

(Incidentally, it doesn't seem obvious to me that scaffolded LLMs are particularly non-general in their understanding 🤔 Especially if we are willing to say yes to questions like "Can humans understand how 16-dimensional space works?" despite the fact that we cannot natively/reliably manipulate those in our minds whereas there are computer programs that can.)

To me, this sounds like you're postulating the existence of a simple algorithm for general-purpose problem-solving which is such that it would be convergently learned by circa-1995 RNNs. Rephrasing, this hypothetical assumes that the same algorithm can be applied to efficiently solve a wide variety of problems, and that it can usefully work even at the level of complexity at which 1995-RNNs were operating.

Sounds like I miscommunicated here. No, my position (and what I was asking about in the hypothetical) is that there are general-purpose architectures + general-purpose problem-solving algorithms that run on those architectures, that they are simple and inefficient (especially given their huge up-front fixed costs), that they aren't new or mysterious (the architectures are used already, far predating humans, & the algorithms are simple), and that we already can see that this sort of generality is not really "where the action is", so to speak.

Conversely, my position is that the algorithm for general intelligence is only useful if it's operating on a complicated world-model + suite of heuristics: there's a threshold of complexity and compute requirements (which circa-1995 RNNs were far below), and general intelligence is an overkill to use for simple problems (so RNNs wouldn't have convergently learned it; they would've learned narrow specialized algorithms instead).

Agreed? This is compatible with an alternative theory, that many other animals do have "the algorithm for general intelligence" you refer to, but that they're running it with less impressive content (world models & heuristics). And likewise with a theory that AI folks already had/have the important discrete generalist algorithmic insights, & instead what they need is a continuous pileup of good cognitive content. Why do you prefer the theory that in both cases, there is some missing binary thing?

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-06T17:07:06.362Z · LW(p) · GW(p)

an architectural change → Turing machines and their neural equivalents

This, yes. I think I see where the disconnect is, but I'm not sure how to bridge it. Let's try...

To become universally capable, a system needs two things:

  1. "Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
  2. "General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.

General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.

Consider a system that has (1) but not (2), such as your "memory + finite state control" example. While, yes, this system meets the requirements for Turing-complete world-modeling, this capability can't be leveraged. Suppose it assembles a completely new region of its world-model. What would it do with it? It needs to leverage that knowledge for constructing practically-implementable plans, but its policy function/heuristics is a separate piece of cognition. So either needs:

  • To get some practical experience, via trial-and-error experiments or a policy gradient, to arrive at good heuristics to employ in this new environment.
  • A policy function that can gracefully expand to this new region — which can plan given only pure knowledge of the environment structure. A policy function that scales in lockstep with the world-model.

The second, in my framework, is general intelligence.

A practical example: Imagine that all your memory of tic-tac-toe has been erased. Then you're given the rules for that game again, and told that in an hour, you'll play a few rounds against a machine that makes random moves. Within that hour, you're free to think and figure out good strategies for winning. I would expect that once the hour is up, you'll be able to win handily against the random-move-maker.

How is that possible?

The knee-jerk reaction may be to suggest that in that hour of thinking, you'll be playing simulated games in your mind, and refining your heuristics this way. That's part of it, but I don't think it's the main trick. Even in these simulated games, you'll likely not start out by making completely random moves, and iteratively converging towards better-than-random strategies by trial-and-error. Rather, you'll look over the rules, analyse the game abstractly, and instantly back out a few good heuristics this way — e. g., that taking the center square is a pretty good move. Only then will you engage in simulated babble-and-prune. (It's the same point John was making here [LW · GW].)

General intelligence is the capability that makes this possible, the algorithm you employ for this "abstract analysis". As I'd stated, it main appeal is that it doesn't require practical experience with the problem domain (simulated or otherwise) — only knowledge of its structure.

This is compatible with an alternative theory, that many other animals do have "the algorithm for general intelligence" you refer to, but that they're running it with less impressive content (world models & heuristics).

Eh, I can grant that. See the point about "no fire alarm", how "weak" AGIs are very difficult to tell apart from very advanced crystallized-intelligence structures (especially if these structures are being trained on-line, as animals are).

Replies from: cfoster0
comment by cfoster0 · 2023-05-06T22:53:08.700Z · LW(p) · GW(p)

Ok I think this at least clears things up a bit.

To become universally capable, a system needs two things:

  1. "Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
  2. "General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.

General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.

Based on the above, I don't understand why you expect what you say you're expecting. We blew past the Turing-completeness threshold decades ago with general purpose computers, and we've combined them with planning algorithms in lots of ways.

Take AIXI, which uses the full power of Turing-completeness to do model-based planning with every possible abstraction/model. To my knowledge, switching over to that kind of fully-general planning (or any of its bounded approximations) hasn't actually produced corresponding improvements in quality of outputs, especially compared to the quality gains we get from other changes. I think our default expectation should be that the real action is in accumulating those "other changes". On the theory that the gap between human- and nonhuman animal- cognition is from us accumulating better "content" (world model concepts, heuristics, abstractions, etc.) over time, it's no surprise that there's no big phase change from combining Turing machines with planning!

General intelligence is the capability that makes this possible, the algorithm you employ for this "abstract analysis". As I'd stated, it main appeal is that it doesn't require practical experience with the problem domain (simulated or otherwise) — only knowledge of its structure.

I think what you describe here and in the content prior is more or less "model-based reinforcement learning with state/action abstraction", which is the class of algorithms that answer the question "What if we did planning towards goals but with learned/latent abstractions?" As far I can tell, other animals do this as well. Yes, it takes a more impressive form in humans because language (and the culture + science it enabled) has allowed us to develop more/better abstractions to plan with, but I see no need to posit some novel general capability in addition.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-07T21:56:50.761Z · LW(p) · GW(p)

it takes a more impressive form in humans because language (and the culture + science it enabled) has allowed us to develop more/better abstractions to plan with, but I see no need to posit some novel general capability in addition

I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry. 

Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together.

The general-intelligence algorithm is what allows all of this to be useful. A non-GI mind can't make use of a newly-constructed concept, because its planning machinery has no idea what to do with it: its policy function doesn't accept objects of this type, hasn't been adapted for them. This makes them unable to learn autonomously, unable to construct heuristics autonomously, and therefore unable to construct new concepts autonomously. General intelligence, by contrast, is a planning algorithm that "scales as fast as the world-model": a planning algorithm that can take in any concept that's been created this way.

Or, an alternative framing...

I think our default expectation should be that the real action is in accumulating those "other changes".

General intelligence is an algorithm for systematic derivation of such "other changes".

Does any of that make sense to you?

Replies from: cfoster0
comment by cfoster0 · 2023-05-07T23:58:11.386Z · LW(p) · GW(p)

I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry.

Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together.

Hmm I feel like you're underestimating animal cognition / overestimating how much of what humans can do comes from unique algorithms vs. accumulated "mental content". Non-human animals don't have language, culture, and other forms of externalized representation, including the particular human representations behind "learning the rules of a game". Without these in place, even if one was using the "universal planning algorithm", they'd be precluded from learning through abstract description and from learning through manipulation of abstract game-structure concepts. All they've got is observation, experiment, and extrapolation from their existing concepts. But lacking the ability to receive abstract concepts via communication doesn't mean that they cannot synthesize new abstractions as situations require. I think there's good evidence that other animals can indeed do that.

General intelligence is an algorithm for systematic derivation of such "other changes".

Does any of that make sense to you?

I get what you're saying but disbelieve the broader theory. I think the "other changes" (innovations/useful context-specific improvements) we see in reality aren't mostly attributable to the application of some simple algorithm, unless we abstract away all of the details that did the actual work. There are general purpose strategies (for ex. the "scientific method" strategy, which is an elaboration of the "model-based RL" strategy, which is an elaboration of the "trial and error" strategy) that are widely applicable for deriving useful improvements. But those strategies are at a very high level of abstraction, whereas the bulk of improvement comes from using strategies to accumulate lower-level concrete "content" over time, rather than merely from adopting a particular strategy.

(Would again recommend Hanson's blog on "The Betterness Explosion" as expressing my side of the discussion here.)

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-08T00:40:47.223Z · LW(p) · GW(p)

Non-human animals don't have language, culture, and other forms of externalized representation, including the particular human representations behind "learning the rules of a game". Without these in place, even if one was using the "universal planning algorithm", they'd be precluded from learning through abstract description and from learning through manipulation of abstract game-structure concepts

Agreed, I think. I'm claiming that those abilities are mutually dependent. Turing-completeness allows to construct novel abstractions like language/culture/etc., but it's only useful if there's a GI algorithm that can actually take these novelties in as inputs. Otherwise, there's no reason to waste compute deriving ahead of time abstractions you haven't encountered yet and won't know how to use; may as well wait until you run into them "in the wild".

In turn, the GI algorithm is (as you point out) only shines if there's extant machinery that's generating novel abstractions for it to plan over. Otherwise, it can do no better than trial-and-error learning.

Replies from: cfoster0
comment by cfoster0 · 2023-05-08T01:17:29.620Z · LW(p) · GW(p)

I guess I don't see much support for such mutual dependence. Other animals have working memory + finite state control, and learn from experience in flexible ways. It appears pretty useful to them despite the fact they don't have language/culture. The vast majority of our useful computing is done by systems that have Turing-completeness but not language/cultural competence. Language models sure look like they have language ability without Turing-completeness and without having picked up some "universal planning algorithm" that would render our previous work w/ NNs ~useless.

Why choose a theory like "the capability gap between humans and other animals is because the latter is missing language/culture and also some binary GI property" over one like "the capability gap between humans and other animals is just because the latter is missing language/culture"? IMO the latter is simpler and better fits the evidence.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-08T01:40:01.274Z · LW(p) · GW(p)

Hmm, we may have reached the point from which we're not going to move on without building mathematical frameworks and empirically testing them, or something.

Other animals have working memory + finite state control, and learn from experience in flexible ways

"Learn from experience" is the key point. Abstract thinking allows to learn without experience — from others' experience that they communicate to you, or from just figuring out how something works abstractly and anticipating the consequences in advance of them occurring. This sort of learning, I claim, is only possible when you have the machinery for generating entirely novel abstractions (language, math, etc.), which in turn is only useful if you have a planning algorithm capable of handling any arbitrary abstraction you may spin up.

"The capability gap between humans and other animals is because the latter is missing language/culture and also some binary GI property" and "the capability gap between humans and other animals is just because the latter is missing language/culture" are synonymous, in my view, because you can't have language/culture without the binary GI property.

Language models sure look like they have language ability 

As per the original post, I disagree that they have the language ability in the relevant sense. I think they're situated firmly on the Simulacrum Level 4 [LW · GW]; they appear to communicate, but it's all just reflexes.

Replies from: cfoster0
comment by cfoster0 · 2023-05-08T02:16:43.620Z · LW(p) · GW(p)

I didn't mean "learning from experience" to be restrictive in that way. Animals learn by observing others & from building abstract mental models too. But unless one acquires abstracted knowledge via communication, learning requires some form of experience: even abstracted knowledge is derived from experience, whether actual or imagined. Moreover, I don't think that some extra/different planning machinery was required for language itself, beyond the existing abstraction and model-based RL capabilities that many other animals share. But ultimately that's an empirical question.

Hmm, we may have reached the point from which we're not going to move on without building mathematical frameworks and empirically testing them, or something.

Yeah I am probably going to end my part of the discussion tree here.

My overall take remains:

  • There may be general purpose problem-solving strategies that humans and non-human animals alike share, which explain our relative capability gains when combined with the unlocks that came from language/culture.
  • We don't need any human-distinctive "general intelligence" property to explain the capability differences among human-, non-human animal-, and artificial systems, so we shouldn't assume that there's any major threshold ahead of us corresponding to it.
comment by Noosphere89 (sharmake-farah) · 2023-05-03T00:11:02.873Z · LW(p) · GW(p)

A possibly-relevant consideration in the analogy to computation is that the threshold of Turing completeness is in some sense extremely low (see one-instruction set computer, Turing tarpits, Rule 110), and is the final threshold.

Nitpick, but it actually isn't the final threshold of computation, though the things that would allow you to compute beyond a Turing Machine are basically cases where we are majorly wrong on the physical laws of the universe, or we somehow have a way to control the fundamental physical constants and/or laws of the universe, and the computers that can legitimately claim to go beyond Turing Machines with known physics aren't useful computers due to the No Free Lunch theorems.

Just worth keeping that in mind.

Replies from: interstice, cfoster0
comment by interstice · 2023-05-03T03:18:21.006Z · LW(p) · GW(p)

computers that can legitimately claim to go beyond Turing Machines with known physics aren’t useful computers due to the No Free Lunch theorems

Non-sequitur, the no-free-lunch theorems don't have anything to do with the physical realizability of hypercomputers.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2023-05-08T13:53:23.851Z · LW(p) · GW(p)

The point is that a random Turing Machine's output is technically uncomputable, which is nice, but it's entirely useless because it uses an entirely flat prior, because it entirely picks randomly from all possible universes, and a No Free Lunch argument can be deployed to show why this isn't useful, because it picks at random from all possible universes/functions.

This, incidentally resolves gedymin's question on the difference between a random hypercomputer and a useful hypercomputer: A useful hypercomputer trades off performance for certain functions/universes in order to do better in other functions/universes, while a random hypercomputer doesn't do that and thus is useless.

Replies from: interstice
comment by interstice · 2023-05-08T17:46:43.315Z · LW(p) · GW(p)

The point is that a random Turing Machine’s output is technically uncomputable

What do you mean? The output of any Turing machine is computable by definition. Do you mean solving the halting problem for a random Turing machine? Or a random oracle?

comment by cfoster0 · 2023-05-03T00:37:16.775Z · LW(p) · GW(p)

Fair. I think this is indeed a nitpick. 😊 In case it wasn't clear, the point remains something like: When we observe/build computational systems in our world that are "better" along some axis than other systems, that "betterness" is not generally derived from having gone over a new threshold of "even more general" computation (they definitely aren't deriving it from hypercomputation, and often aren't even deriving it from universal Turing computation), but through being better suited to the capability in question.

comment by Prometheus · 2024-02-21T21:03:34.624Z · LW(p) · GW(p)

I think my main problem with this is that it isn't based on anything. Countless times, you just reference other blog posts, which reference other blog posts, which reference nothing. I fear a whole lot of people thinking about alignment are starting to decouple themselves from reality. It's starting to turn into the AI version of String Theory. You could be correct, but given the enormous number of assumptions your ideas are stacked on (and that even a few of those assumptions being wrong leads to completely different conclusions), the odds of you even being in the ballpark of correct seem unlikely.

Replies from: Thane Ruthenis, habryka4
comment by Thane Ruthenis · 2024-02-22T00:10:52.349Z · LW(p) · GW(p)

I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.

I think my main problem with this is that it isn't based on anything

Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which "provide scientific evidence" that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory [LW · GW], situations in which humans do try to behave approximately like rational agents... There probably also are some psychology-survey results demonstrating stuff like "yes, humans do commonly report wanting to be consistent in their decision-making rather than undergoing wild mood swings and acting at odds with their own past selves", which would "provide evidence" for the hypothesis that complex minds want their utilities to be coherent [LW · GW].

That's actually an interesting idea! This is basically what my model is based on, after a fashion, and it makes arguments-from-introspection "legible" instead of seeming to be arbitrary philosophical navel-gazing.

Unfortunately, I didn't have this idea until a few minutes ago, so I haven't been compiling a list of "primary sources". Most of them are lost to time, so I can't compose a decent object-level response to you here. (The Wikipedia links are probably a decent starting point, but I don't expect you to trawl through all that.)

Still, that seems like a valuable project. I'll put a pin in it, maybe post a bounty for relevant papers later.

Replies from: Prometheus
comment by Prometheus · 2024-02-22T18:21:09.478Z · LW(p) · GW(p)

This isn't what I mean. It doesn't mean you're not using real things to construct your argument, but that doesn't mean the structure of the argument reflects something real. Like, I kind of imagine it looking something like a rationalist Jenga tower, where if one piece gets moved, it all crashes down. Except, by referencing other blog posts, it becomes a kind of Meta-Jenga: a Jenga tower composed of other Jenga towers. Like "Coherent decisions imply consistent utilities [LW · GW]". This alone I view to be its own mini Jenga tower. This is where I think String Theorists went wrong. It's not that humans can't, in theory, form good reasoning based on other reasoning based on other reasoning and actually arrive at the correct answer, it's just that we tend to be really, really bad at it.

 

The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model, and any other model either doesn't predict it as accurately, or is more complex than yours.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-02-22T19:11:21.770Z · LW(p) · GW(p)

The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model

My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena.

Again, the drive for consistent decision-making is a good example. Common-sensically, I don't think we'd disagree that humans want their decisions to be consistent. They don't want to engage in wild mood swings, they don't want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources.

Yet it's not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You'd need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact [LW · GW]).

Replies from: Prometheus
comment by Prometheus · 2024-02-23T04:23:48.611Z · LW(p) · GW(p)

"My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena."

 

Such as? I wouldn't call Shard Theory mainstream, and I'm not saying mainstream models are correct either. On human's trying to be consistent decision-makers, I have some theories about that (and some of which are probably wrong). But judging by how bad humans are at it, and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this, even if the hardware is very stubborn at times. But even that isn't optimized too hard toward coherence. Someone might prefer pizza to hot dogs, but they probably won't always choose pizza over any other food, just because they want their preference ordering of food to be consistent. And, sure, maybe what they "truly" value is something like health, but I imagine even if they didn't, they still wouldn't do this.

 

But all of this is still just one piece on the Jenga tower. And we could debate every piece in the tower, and even get 90% confidence that every piece is correct... but if there are more than 10 pieces on the tower, the whole thing is still probably going to come crashing down. (This is the part where I feel obligated to say, even though I shouldn't have to, that your tower being wrong doesn't mean "everything will be fine and we'll be safe", since the "everything will be fine" towers are looking pretty Jenga-ish too. I'm not saying we should just shrug our shoulders and embrace uncertainty. What I want is to build non-Jenga-ish towers)

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2024-02-23T06:10:37.632Z · LW(p) · GW(p)

I wouldn't call Shard Theory mainstream

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis[1]).

judging by how bad humans are at [consistent decision-making], and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this

Roughly agree, yeah.

But all of this is still just one piece on the Jenga tower

I kinda want to push back against this repeat characterization – I think quite a lot of my model's features are "one storey tall", actually – but it probably won't be a very productive use of the time of either of us. I'll get around to the "find papers empirically demonstrating various features of my model in humans" project at some point; that should be a more decent starting point for discussion.

What I want is to build non-Jenga-ish towers

Agreed. Working on it.

  1. ^

    Which, yeah, I think is false: scaling LLMs won't get you to AGI. But it's also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.

Replies from: Prometheus
comment by Prometheus · 2024-02-23T19:15:02.198Z · LW(p) · GW(p)

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis

It tends not to get talked about much today, but there was the PDP (connectionist) camp of cognition vs. the camp of "everything else" (including ideas such as symbolic reasoning, etc). The connectionist camp created a rough model of how they thought cognition worked, a lot of cognitive scientists scoffed at it, Hinton tried putting it into actual practice, but it took several decades for it to be demonstrated to actually work. I think a lot of people were confused by why the "stack more layers" approach kept working, but under the model of connectionism, this is expected. Connectionism is kind of too general to make great predictions, but it doesn't seem to allow for FOOM-type scenarios. It also seems to favor agents as local optima satisficers, instead of greedy utility maximizers. 

comment by habryka (habryka4) · 2024-02-21T21:24:45.327Z · LW(p) · GW(p)

Hmm, I feel sad about this kind of critique. Like, this comment invokes some very implicit standard for posts, without making it at all explicit. Of course neither this post nor the posts they link to are literally "not based on anything". My guess is you are invoking an implicit standard for work to be "empirical" in order to be "anything", but that also doesn't really make sense since there are a lot of empirical arguments in this article and in the linked articles.

I think highlighting any specific assumption, or even some set of assumptions that you think is fragile would be helpful. Or being at all concrete about what you would consider work that is "anything". But I think as it stands I find it hard to get much out of comments like this.

Replies from: habryka4
comment by habryka (habryka4) · 2024-02-22T07:21:03.255Z · LW(p) · GW(p)

(Please don't leave both top-level reacts and inline reacts of the same type on comments, that produces somewhat clearly confusing summary statistics. We might make it literally impossible, but until then, pick one and stick to it)

comment by NicholasKross · 2023-06-30T02:48:51.115Z · LW(p) · GW(p)

Even after thinking through these issues in SERI-MATS, and already agreeing with at least most of this post, I was surprised upon reading it how many new-or-newish-to-me ideas and links it contained.

I'm not sure if that's more of a failure of me, or of the alignment field to notice "things that are common between a diverse array of problems faced". Kind of related to my hunch that multiple alignment concepts ("goals", "boundaries", "optimization") will turn out to be isomorphic to the same tiny-handful of mathematical objects.

comment by MichaelStJules · 2023-05-09T02:16:30.667Z · LW(p) · GW(p)

On this take, especially with your skepticism of LLM fluid intelligence and generality, is there much reason to expect AGI to be coming any time soon? Will it require design breakthroughs?

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-09T03:37:26.954Z · LW(p) · GW(p)

I expect it to require a non-trivial design breakthrough, yes. I do not expect it to require many breakthroughs, or for it to take much longer once the breakthrough is made — see the "agency overhang" concerns. And there's a lot of money being poured into AI now and a lot of smart people tirelessly looking for insights...

<10 years, I'd expect, assuming no heavy AI regulation/nuclear war/etc.

Though, for all I know, some stupidly simple tweak to the current paradigm will be sufficient, and it may already be published in a paper somewhere, and now that OpenAI has stopped playing with scale [LW · GW] and is actively looking for new ideas — for all I know, they may figure it out tomorrow.

Replies from: MichaelStJules
comment by MichaelStJules · 2023-05-09T04:24:51.882Z · LW(p) · GW(p)

If they have zero fluid intelligence now, couldn't it be that building fluid intelligence is actually very hard and we're probably a long way off, maybe decades? It sounds like we've made almost no progress on this, despite whatever work people have been doing.

There could still be a decent probability of AGI coming soon, and that could be enough to warrant acting urgently (or so could non-AGI, e.g. more task-specific AI used to engineer pathogens).

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-09T05:58:07.930Z · LW(p) · GW(p)

Suppose that some technology requires 10 components to get to work. Over the last decades, you've seen people gradually figure out how to build each of these components, one by one. Now you're looking at the state of the industry, and see that we know how to build 9 of them. Do you feel that the technology is still a long time away, because we've made "zero progress" towards figuring out that last component?

Advancements along the ML paradigm were not orthogonal to progress to AGI. On the contrary: they've set up things so that figuring out fluid intelligence/agency is potentially the last puzzle piece needed.

A different angle: these advancements have lowered the bar for how well we need to understand fluid intelligence to get to AGI. If before, we would've needed to develop a full formal theory of cognition that we may leverage to build a GOFAI-style AGI, now maybe just some regularizer applied to a transformer on a "feels right" hunch will suffice.

Replies from: MichaelStJules
comment by MichaelStJules · 2023-05-09T07:38:41.306Z · LW(p) · GW(p)

Suppose that some technology requires 10 components to get to work. Over the last decades, you've seen people gradually figure out how to build each of these components, one by one. Now you're looking at the state of the industry, and see that we know how to build 9 of them. Do you feel that the technology is still a long time away, because we've made "zero progress" towards figuring out that last component?

This seems pretty underspecified, so I don't know, but I wouldn't be very confident it's close:

  1. Am I supposed to assume the difficulty of the last component should reflect the difficulty of the previous ones?
  2. I'm guessing you're assuming the pace of building components hasn't been decreasing significantly. I'd probably grant you this, based on my impression of progress in AI, although it could depend on what specific components you have in mind.
  3. What if the last component is actually made up of many components?

I agree with the rest of your comment, but it doesn't really give me much reason to believe it's close, rather than just closer than before/otherwise.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-09T08:17:33.641Z · LW(p) · GW(p)

Yeah, it was pretty underspecified, I was just gesturing at the idea.

Even more informally: Just look at GPT-4. Imagine that you're doing it with fresh eyes, setting aside all the fancy technical arguments. Does it not seem like it's almost there? Whatever the AI industry is doing, it sure feels like it's moving in the right direction, and quickly. And yes, it's possible that the common sense is deceptive here; but it's usually not.

Or, to make a technical argument: The deep-learning paradigm is a pretty broad-purpose trick. Stochastic gradient descent isn't just some idiosyncratic method of training neural networks; it's a way to automatically generate software that meets certain desiderata. And it's compute-efficient enough to generate software approaching human brains in complexity. Thus, I don't expect that we'll need to move beyond it to get to AGI — general intelligence is reachable by doing SGD over some architecture.

I expect we'll need advancement(s) on the order of "fully-connected NN -> transformers", not "GOFAI -> DL".

Replies from: MichaelStJules
comment by MichaelStJules · 2023-05-09T09:03:22.915Z · LW(p) · GW(p)

I would say it seems like it's almost there, but it also seems to me to already have some fluid intelligence, and that might be why it seems close. If it doesn't have fluid intelligence, then my intuition that it's close may not be very reliable.

comment by PeterMcCluskey · 2023-05-06T02:56:32.000Z · LW(p) · GW(p)

Might this paradigm be tested by measuring LLM fluid intelligence?

I predict that a good test would show that current LLMs have modest amounts of fluid intelligence, and that LLM fluid intelligence will increase in ways that look closer to continuous improvement than to a binary transition from nothing to human-level.

I'm unclear whether it's realistic to get a good enough measure of fluid intelligence to resolve this apparent crux, but I'm eager to pursue any available empirical tests of AI risk.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-05-02T23:34:27.423Z · LW(p) · GW(p)

I agree with some of this, although I'm doubtful that the transition from sub-AGI to AGI is as sharp as outlined. I don't think that's impossible though, and I'd rather not take the risk. I do think it's possible to dumb down an AGI if you still have enough control over it to do things like inject noise into its activations between layers...

I'm hopeful that we can solve alignment iff we can contain and study a true AGI. Here's a comment I wrote on another post about the assumptions which give me hope we might manage alignment...

It seems to me like one of the cruxes is that there is this rough approximate alignment that we can currently do. It's rough in the sense that it's spotty, not covering all cases. It's approximate in that its imprecise and doesn't seem to work perfectly even in the cases it covers.

The crux is whether the forecaster expects this rough approximate alignment to get easier and more effective as the model gets more capable, because the model understands what we want better. Or whether it will get harder as the model gets more capable, because the model will cross certain skill thresholds relating to self-awareness and awareness of instrumental goals.

I am in the camp that this will get harder as the model gets more competent. If I were in the 'gets easier' camp, then my views would be substantially closer to Rohin's and Quinton Pope's and Alex Turner's more optimistic views.

I am, however, a bit more optimistic than Connor I think. My optimism hinges on a different crux which has come up multiple times when discussing this with less optimistic people having views more like Connor's or Eliezer's or Nate Soares'.

This crux which gives me an unusual amount of optimism depends on three hopes.

First is that I believe it is possible to safely contain a slightly-superintelligent AGI in a carefully designed censored training simulation on a high security compute cluster. 

Second is that I also think that we will get non-extinction level near-misses before we have a successfully deceptive AGI, and that these will convince the leading AI labs to start using more thorough safety precautions. I think there are a lot of smart people currently in the camp of "I'll believe it when I see it" for AGI risk. It is my hope that they will change their minds and behaviors quickly once they do see real world impacts.

Third is that we can do useful alignment experimentation work on the contained slightly-superhuman AGI without either accidentally releasing it or fooling ourselves into thinking we've fixed the danger without truly having fixed it. This gives us the opportunity to safely iterate gradually towards success.

Obviously, all three of these are necessary for my story of an optimistic future to go well. A failure of one renders the other two moot.

Note that I expect an adequate social response would include bureaucratic controls adequate to prevent reckless experimentation on the part of monkeys overly fascinated by the power of the poisoned banana.

comment by Chris_Leong · 2023-05-07T16:14:24.314Z · LW(p) · GW(p)

I see some value in the framing of "general intelligence" as a binary property, but it also doesn't quite feel as though it fully captures the phenomenon. Like, it would seem rather strange to describe GPT4 as being a 0 on the general intelligence scale.

I think maybe a better analogy would be to consider the sum of a geometric sequence.

Consider the sum for a few values of r as it increases at a steady rate.

0.5 - 2a
0.6 - 2.5a
0.7 - 3.3a
0.8 - 5a
0.9 - 10a
1 - Diverges to infinity

What we see then is quite significant returns to increases in r and then a sudden divergence.

(Aside: This model feels related to that of nuclear chain reactions in that you can model the total production of reactions as a geometric sequence, however, this model doesn't just have sub-criticality and super-criticality, but criticality.  And I'm not sure how you'd fit criticality in here).

In contrast, many economists want to model AI as a more traditional exponentially increasing system (ie. ).

comment by Seth Herd · 2023-05-06T19:49:58.919Z · LW(p) · GW(p)

I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.

Unfortunately, I mostly agree with these positions. I'm afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it's not enough to make me truly optimistic. My p(doom) is right around the 50% "who knows" mark.

I'll restate the main claims as:

  1. We only get one chance
  2. We have to get the AIs values exactly aligned with human values
    1. There will be a sharp discontinuity as an AI becomes truly generally intelligent
    2. the process of value reflection seems highly unstable
  3. No known method of dodging this problem is likely to work

The source of all most my disagreement with you is in the type of AGI we expect. I expect (with above 50% probably) AGI to arise from the expansion of LLMs into language model based cognitive architectures that use LLMs as the core engine, but expand on them in a chain-of-thought, and allow them to use external tools. These expectations are entirely theoretical since AutoGPT and HuggingGPT were only released about a month or so ago. My post Capabilities and alignment of LLM cognitive architectures [LW · GW] elaborates on why I expect these to work well. 

I think such systems will readily become weakly general (at variance from your expectation of a more all-or-nothing transition) by learning about new domains through web search and experimentation with their cognitive tools, storing that knowledge in episodic memory. (I also think that before long, they will use that episodic, declarative knowledge to fine-tune the central LLM, much as humans absorb new knowledge into skills).  Importantly, I expect this generality to extend to understanding themselves as systems, and thereby giving rise to something like value reflection.

This is bad because it advances timelines if true, but really good in that such systems can be run without using any RL or persistent context in one LLM.

None of the above considerations are in that post; I'm writing another that focuses on them.

In that scenario, I expect us to get a few shots, as the transition to truly general will be slow and happen in highly interpretable natural language agent systems. There are still many dangers, but I think this would massively improve our odds.

Whether or not AGI arises from that or a different network-based system, I agree that the value reflection process is unpredictable, so we may have to get value alignment exactly right. I expect the central strongest value to be preserved in a reflective value-editing process. But that means that central value has to be exactly right. Whether any broader configuration of values might be stable in a learning network is unknown, and I think worthy of a good deal more thought.

One random observation: I think your notion of general intelligence overlaps strongly with the common concept of recursive self improvement, which many people do include in their mental models.

Anyway, thanks for an insightful post that nails a good deal of the variance between my model of the average alignment optimist and pessimist.

Replies from: roger-d-1
comment by RogerDearnaley (roger-d-1) · 2023-12-05T10:41:33.178Z · LW(p) · GW(p)

We have to get the AIs values exactly aligned with human values

This is a major crux for me, and one of the primary reasons my P(DOOM) isn't >90%. If you use value learning [? · GW], you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words 'human' and 'values' mean and what the definition of "human values" is, like any small LLM has), and b) not kill everyone while it's learning the details, and it will do its research and Bayesianly converge on human values (and if it's not capable of competently being Bayesian enough to do that, it's not superhuman, at least at STEM [LW · GW]). So, if you use value learning, the only piece you need to get exactly right (for outer alignment) is the phrasing of the terminal goal saying "Use value learning". For something containing an LLM, I think that might be about one short paragraph of text, possibly with one equation in it. The prospect of getting one paragraph of text and one equation right, with enough polishing and peer review, doesn't actually seem that daunting to me.

comment by Garrett Baker (D0TheMath) · 2023-05-03T20:06:45.804Z · LW(p) · GW(p)

Your definition of general intelligence would include SGD on large neural networks. It is able to generalize from very few examples, learn and transform novel mathematical objects, be deployed on a wide variety of problems, and so on. Though it seems a pretty weak form of general intelligence, like evolution or general function optimization algorithms. Though perhaps its less general than evolution and less powerful than function optimization algorithms.

If we take this connection at face-value, we can maybe use SGD as a prototypical example for general intelligence, and ask: what about SGD makes it so generally capable? A few answers come to mind:

  1. Neural networks with SGD have a pretty good (though perhaps imperfect) prior
  2. Neural networks with SGD scale adequately with more network nodes and data Intuitively you would expect to only need one of these. You should be able to make up a faulty prior with a ton of data or make up for not much data with a pretty good prior. Neural networks with SGD seem pretty good at both of these, indicating perhaps that they don't trade off against each other as much as a Bayesian may naively think. In particular, worlds where both of these components are necessary for an adequate general intelligence are worlds where data are generated from a long-tailed distribution over distributions. Aka, where its not that uncommon to encounter novel information despite already having a ton of information.

SGD seems like it can be made tremendously smarter by feeding in more data and stacking more layers. If you have some SGD-like process in your neural network, these improvements seem easy enough to develop by further training. You could imagine an alpha-go-style process which takes in a data structure including a problem statement and a goal, and derives a winning plan for the problem. Increasing the number of layers would be the equivalent of devoting more subspaces to the computation, and more data would involve replicating a similar process in the next layer or through recurrency. With current LLMs, both processes seem difficult to implement, but people have found SGD-like structures inside transformers (and many other networks with residual connections), so even a sprinkle of SGD without scaling seems to pull its weight.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-04T12:47:17.122Z · LW(p) · GW(p)

Your definition of general intelligence would include SGD on large neural networks

I don't count it in, actually. In my view, the boundaries of the algorithm here aren't "SGD + NN", but "the training loop" as a whole, which includes the dataset and the loss/reward function. A general intelligence implemented via SGD, then, would correspond to an on-line training loop that can autonomously (without assistance from another generally-intelligent entity, like a human overseer) learn to navigate any environment.

I don't think any extant training-loop setup fits this definition. They all need externally-defined policy gradients [LW · GW]. If the distribution on which they're trained changes significantly, the policy gradient (loss/reward function) would need to be changed to suit — and that'd need to be done by something external to the training loop, which already understands the new environment (e. g., the human overseer) and knows how the policy gradient needs to be adapted to keep the system on-target.

(LLMs trained via SSL are a degenerate case: in their case the prediction gradient = the policy gradient. They also can't autonomously generalize to generating new classes of text without first being shown a carefully curated dataset of such texts. They're not an exception.)

Replies from: D0TheMath
comment by Garrett Baker (D0TheMath) · 2023-05-04T15:49:09.173Z · LW(p) · GW(p)

I’m skeptical that locating the hyperparameters you mention is an AGI-complete task.

comment by RogerDearnaley (roger-d-1) · 2023-12-05T10:59:34.357Z · LW(p) · GW(p)

I agree completely about AGI being like Turing completeness, that there's a threshold. However, there are programming languages that are technically Turing complete, but only a masochist would actually try to use. So there could be a fire alarm, while the AGI is still writing all the (mental analogs of) domain-specific languages and libraries it needs. My evidence for this is humans: we're over the threshold, but barely so, and it takes years and years of education to turn us into quantum field theorist or aeronautical engineer.

But my main crux is that I think we already have a good idea how to align an AGI: value learning [? · GW]. See my post Requirements for a STEM-capable AGI Value Learner [LW · GW]. That's an alignment technique that only works on things over the threshold.

comment by Lightcide · 2023-05-14T03:29:25.470Z · LW(p) · GW(p)

In-context learning in LLMs maps fairly well onto the concept of fluid intelligence. There are several papers now indicating that general learning algorithms emerge in LLMs to facilitate in-context learning.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-14T04:37:12.507Z · LW(p) · GW(p)

I assume you're talking about things like that [LW · GW]?

I'll grant that it does have a spooky vibe.

comment by Thane Ruthenis · 2023-05-09T07:46:10.677Z · LW(p) · GW(p)

Do you think you could find or develop a test of fluid intelligence that LLMs would fail to demonstrate any fluid intelligence in and generally do worse than the vast majority of humans on?

See here [LW(p) · GW(p)], starting from "consider a scheme like the following". In short: should be possible, but seems non-trivially difficult.

Do you think LLMs haven't developed general problem-solving heuristics by seeing lots and lots of problems across domains as well as plenty of fluid intelligence test questions and answers? Wouldn't that count as fluid intelligence?

I think forcing increasingly-larger LLMs to solve increasingly more problems of increasingly greater variety does result in an AGI, in the limit. I suspect it's not a realistic way to approach that limit, however — that we're going to run into ruinous demands on compute/data before we succeed at this. Consider trying to generate a literal "question -> answer" lookup table, for every possible pair of questions and answers. It's not quite the case with LLMs, but the relevant intuition should apply.

... If my baseline suspicion is right. I do acknowledge that I may be wrong about LLMs, that they may already have fluid intelligence/be "weak" AGIs. I'd place it under 2% for GPT-4, though.

Replies from: MichaelStJules
comment by MichaelStJules · 2023-05-09T08:47:32.068Z · LW(p) · GW(p)

(Your reply is in response to a comment I deleted, because I thought it was basically a duplicate of this one [LW(p) · GW(p)], but I'd be happy if you'd leave your reply up, so we can continue the conversation.)

See here [LW(p) · GW(p)], starting from "consider a scheme like the following". In short: should be possible, but seems non-trivially difficult.

That seems like a high bar to me for testing for any fluid intelligence, though, and the vast majority of humans would do about as bad or worse (but possibly because of far worse crystallized intelligence). Similarly, in your post, "No scientific breakthroughs, no economy-upturning startup pitches, certainly no mind-hacking memes."

 

I would say to look at it based on definitions and existing tests of fluid intelligence. These are about finding patterns and relationships between unfamiliar objects and any possible rules relating to them, applying those rules and/or inference rules with those identified patterns and relationships, and doing so more or less efficiently. More fluid intelligence means noticing patterns earlier, taking more useful steps and fewer useless steps.

Some ideas for questions:

  1. Invent new games or puzzles, and ask it to achieve certain things from a given state.
  2. Invent new mathematical structures (e.g. new operations on known objects, or new abstract algebraic structures based on their axioms) and ask the LLM to reason about them and prove theorems (that weren't too hard to prove yourself or for someone else to prove).
  3. Ask it to do hardness proofs (like NP-hardness proofs), either between two new problems, or just with one problem (e.g. ChatGPT proved a novel problem was NP-hard here).
  4. Maybe other new discrete math problems.
  5. EDIT: New IMO and Putnam problems.

My impression is that there are few cross-applicable techniques in these areas, and the ones that exist often don't get you very far to solving problems. To do NP-hardness proofs, you need to identify patterns and relationships between two problems. The idea of using "gadgets" is way too general and hides all of the hard work, which is finding the right gadget to use and how to use it. EDIT: For IMO and Putnam problems, there are some common tools, too, but if just simple pattern matching for those was all it took, math undergrads would generally be good at them, and they're not, so it probably does take considerable fluid intelligence.

I guess one possibility is that an LLM can try a huge number of steps and combinations of steps before generating the next token, possibly looking ahead multiple steps internally before picking one. Maybe it could solve hard problems this way without fluid intelligence.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-09T19:38:27.303Z · LW(p) · GW(p)

Invent new mathematical structures (e.g. new operations on known objects, or new abstract algebraic structures based on their axioms) and ask the LLM to reason about them and prove theorems (that weren't too hard to prove yourself or for someone else to prove).

Yup, that's also my current best guess for how this sort of test must look like.

  • Choose some really obscure math discipline , one that we're pretty sure lacks much practical applications (i. e., won't be convergently learned from background data about the world).
  • Curate the AI's dataset to only include information up to some point in time .
  • Guide the AI step-by-step (as in, via chain-of-thought prompting or its equivalent) through replicating all discoveries made in  between  and the present .

Pick the variables such that the inferential gap between  and  is large (can't be cleared by a non-superintelligent logical leap), but the gaps between individual insights are tiny. This would ensure that our AI would only be able to reach  if it's able to re-use its insights (i. e., build novel abstractions, store them in the context window/short-term memory, fluidly re-use them when needed), while not putting onerous demands on how good each individual insight must be. See also [LW(p) · GW(p)].

I should probably write up a post about it, and maybe pitch this project to the LTFF or something.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-05-06T17:47:34.652Z · LW(p) · GW(p)

It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.

Seems very overconfident if not plain wrong; consider as an existence proof that 'mathematicians score higher on tests of autistic traits, and have higher rates of diagnosed autism, compared with people in the general population' and classic autism tests are about false belief inference.

Replies from: localdeity
comment by localdeity · 2023-05-06T18:41:48.946Z · LW(p) · GW(p)

Interesting point.  Though I suspect—partly using myself as an example (I scored 33 on the Autism Spectrum Quotient, and for math I'll mention qualifying for USAMO 3 times)—that these autistic mathematician types, while disinclined to be deceptive (likely finding it abhorrent, possibly having strong ethical stances about it), are still able to reason about deception in the abstract: e.g. if you give them logic puzzles involving liars, or detective scenarios where someone's story is inconsistent with some of the evidence, they'll probably do well at them.  Or, if you say "For April Fool's, we'll pretend we're doing X", or "We need to pretend to the Nazis that we're doing X", they can meticulously figure out all the details that X implies and come up with plausible justifications where needed.

In other words, although they're probably disinclined to lie and unpracticed at it, if they do decide to do it, I think they can do it, and there are aspects of constructing a plausible, mostly-consistent lie that they're likely extremely good at.

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-05-07T09:27:21.236Z · LW(p) · GW(p)

Thanks for your comment and your perspective, that's an interesting hypothesis. My intuition was that worse performance at false belief inference -> worse at deception, manipulation, etc. As far as I can tell, this seems mostly born out by a quick Google search e.g. Autism and Lying: Can Autistic Children Lie?, Exploring the Ability to Deceive in Children with Autism Spectrum Disorders, People with ASD risk being manipulated because they can't tell when they're being lied to, Strategic Deception in Adults with Autism Spectrum Disorder.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-07T16:51:57.108Z · LW(p) · GW(p)

My opinion is that it's caused by internal limitations placed on the general-intelligence component (see footnote 2). Autistic people can reason about deception formally, same as anybody, but they can't easily translate that understanding into practical social acumen, because humans don't have write-access to their instincts/shards/System 1.

And they have worse instincts in the social domain to begin with because of... genes that codify nonstandard reward/reinforcement circuitry [LW · GW], I assume? Suppose that in a median person, there's circuitry that reinforces cognition that is upstream of some good social consequences, like making a person smile. That gradually causes the accumulation of crystallized-intelligence structures/shards specialized for social interactions. Autistic people lack this signal, or receive weaker reinforcement from it[1]. Thus, by default, they fail to develop much System-1 expertise for this domain.

They can then compensate for it by solving the domain "manually" using their fully general intelligence. They construct good heuristics, commit them to memory, and learn to fire them when appropriate — essentially replicating by-hand the work that's done automatically in the neurotypical people's case.

Or so my half-educated guess goes. I don't have much expertise here, beyond reading some Scott Alexander. @cfoster0 [LW · GW], want to weigh in here?

As to superintelligent AGIs, they would be (1) less limited in their ability to directly rewrite their System-1-equivalent (their GI components would have more privileges over their minds), (2) much better at solving domains "manually" and generating heuristics "manually". So even if we do hamstring our AGI's ability to learn e. g. manipulation skills, it'll likely be able to figure them out on its own, once it's at human+ level of capabilities.

  1. ^

    Reminder that reward is not the optimization target [LW · GW]. What I'm stating here is not exactly "autistic people don't find social interactions pleasant so they don't engage in them so they don't get experience with them". Rather, it means that the optimization process that's shaping their cognition literally doesn't pay much/any attention to improving their socialization-related cognition.

    If a neurotypical person has some thought, and acts on it, and that successfully manages to make someone smile, their cognition automatically changes to be more likely to have such thoughts in the future. An autistic person's cognition would not be changed such.

Replies from: localdeity
comment by localdeity · 2023-05-07T17:17:31.433Z · LW(p) · GW(p)

I've seen several smart autistic people on the internet say some variation of "they learn to emulate in software what normal people do in hardware, and that's how they manage to navigate social life well enough".  Essentially as you describe.  And I'd add that a major component of high intelligence likely means being good at doing things "in software" (approx. "system 2").

comment by PeterMcCluskey · 2023-05-06T02:55:39.773Z · LW(p) · GW(p)

Upvoted for clarifying a possibly important crux. I still have trouble seeing a coherent theory here.

I can see a binary difference between Turing-complete minds and lesser minds, but only if I focus on the infinite memory and implicitly infinite speed of a genuine Turing machine. But you've made it clear that's not what you mean.

When I try to apply that to actual minds, I see a wide range of abilities at general-purpose modeling of the world.

Some of the differences in what I think of as general intelligence are a function of resources, which implies a fairly continuous scale, not a binary distinction.

Other aspects are a function of accumulated knowledge. That's somewhat lumpier, but still doesn't look close to a binary difference.

Henrich's books The Secret of Our Success and The WEIRDest People in the World [LW · GW] suggest that humans have been gradually building up the ability to handle increasingly abstract problems.

Our ancestors of a couple million years ago had language that enabled them to handle a somewhat larger class of mental tasks than other apes.

Tools such as writing, and new concepts such as the Turing machine, enabled them to model ideas that they'd previously failed to find ways to handle.

I see plenty of hints that other mammals have weaker versions of this abstract thought. I'd be surprised if humans have reached the limits of what is possible.

So, when I try to treat general intelligence as a binary, I alternate between doubting that humans have it, and believing that most animals and LLMs have it.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-06T17:16:17.628Z · LW(p) · GW(p)

Hm, I think your objections are mostly similar to the objections cfoster0 is raising in this thread [LW(p) · GW(p)], so in lieu of repeating myself, I'll just link there.

Do point out if I misunderstood and some of your points are left unaddressed.

comment by Rohin Shah (rohinmshah) · 2023-05-05T08:47:17.974Z · LW(p) · GW(p)

What ties it all together is the belief that the general-intelligence property is binary.

Do any humans have the general-intelligence property?

If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)

If no, why do we believe the general-intelligence property exists?

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-05T12:50:48.348Z · LW(p) · GW(p)

Do any humans have the general-intelligence property?

Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of stopping the training the moment our AI becomes true AGI but still human-level or below it, we're likely to miss the right moment without advanced interpretability tools, and scale it past "human-level" straight to "impossible-to-ignore superintelligent". There would be no warning signs, because "weak" AGI (human-level or below) can't be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour.

See Section 5 [LW · GW] for more discussion of all of that.

If no, why do we believe the general-intelligence property exists?

Quoting from my discussion with cfoster0 [LW(p) · GW(p)]:

[If general intelligence didn't exist,] I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.

E. g., there'd be things like, we'd see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening.  As it is, sure, some animals have specialized cognitive algorithms that may be better than human ones in their specific niches, but we seem to always be able to comprehend them. We can always figure out why they decide to execute various plans, based on what evidence, and how these plans lead to whatever successes they achieve. A human can model any animal's cognition; a human's cognition is qualitatively more capable than any animal's. If true generality didn't exist, I'd expect that not to be true.

Scaling it up, the universe as a whole would be more incomprehensible. I'd referred to ontologically complex processes [LW · GW] when discussing that in Section 3 — processes such that there are no cognitive features in our minds that would allow us to emulate them. That'd be the case all over the place: we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning.

The fact that we observe neither (and that this state of affairs is even hard/impossible for us to imagine) suggests that we're fully general, in the sense outlined in the post.

Replies from: rohinmshah, TAG
comment by Rohin Shah (rohinmshah) · 2023-05-05T20:32:23.939Z · LW(p) · GW(p)

See Section 5 [LW · GW] for more discussion of all of that.

Sorry, I seem to have missed the problems mentioned in that section on my first read.

There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions.

I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.

(I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity [LW · GW] in capabilities once we get to AGI, and attempts to iterate on alignment will fail [LW · GW].  Either we get AGI right on the first try, or we die. In particular it seems like if the discontinuity ends before human level then you can iterate on alignment.)

that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts)

Why isn't this also true of the weak AGI? Current models cannot autonomously get more compute (humans have to give it to them) or perform gradient descent on their own weights (unless the humans specifically try to make that happen); most humans placed in the models' position would not be able to do that either.

It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn't be able to get itself more compute / privileges, but we will not realize that it's AGI, so we'll give it more compute / privileges until it gets to "so superintelligent we can't do anything about it". Is that correct?

There would be no warning signs, because "weak" AGI (human-level or below) can't be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour.

... Huh. How do you know that humans are generally intelligent? Are you relying on introspection on your own cognitive process, and extrapolating that to other humans?

What if our policy is to scale up resources / privileges available to almost-human-level AI very slowly? Presumably after getting to a somewhat-below-human-level AGI, with a small amount of additional resources it would get to a mildly-superhuman-level AI, and we could distinguish it then?

Or maybe you're relying on an assumption that the AGI immediately becomes deceptive and successfully hides the fact that it's an AGI?

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-05T21:05:02.695Z · LW(p) · GW(p)

I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level.

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn't be able to get itself more compute / privileges, but we will not realize that it's AGI, so we'll give it more compute / privileges until it gets to "so superintelligent we can't do anything about it". Is that correct?

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

The strategy of slowly scaling our AI up is workable at the core, but IMO there are a lot of complications:

  • A "mildly-superhuman" AGI, or even just a genius-human AGI, is still be an omnicide risk [LW · GW] (see also [LW · GW]). I wouldn't want to experiment with that; I would want it safely at average-human-or-below level. It's likely hard to "catch" it at that level by inspecting its external behavior, though: can only be reliably done via advanced interpretability tools.
  • Deceptiveness (and manipulation) is a significant factor, as you've mentioned. Even just a mildly-superhuman AGI will likely be very good at it. Maybe not implacably good, but it'd be like working bare-handed with an extremely dangerous chemical substance, with the entire humanity at the stake.
  • The problem of "iterating" on this system. If we have just a "weak" AGI on our hands, it's mostly a pre-AGI system, with a "weak" general-intelligence component that doesn't control much. Any "naive" approaches, like blindly training interpretability probes on it or something, would likely ignore that weak GI component, and focus mainly on analysing or shaping heuristics/shards. To get the right kind of experience from it, we'd have to very precisely aim our experiments at the GI component — which, again, likely requires advanced interpretability tools.

Basically, I think we need to catch the AGI-ness while it's an "asymptomatic" stage, because the moment it becomes visible it's likely already incredibly dangerous (if not necessarily maximally dangerous).

... Huh. How do you know that humans are generally intelligent? Are you relying on introspection on your own cognitive process, and extrapolating that to other humans?

More or less, plus the theoretical argument from the apparent Turing-completeness of human understanding and the lack of empirical evidence to the contrary. Our "mental vocabulary" is Turing-complete, so we should very literally be able to model anything that can be modeled (up to our working-memory limits) — and, indeed, we're yet to observe anything we can't model.

I'm not sure why the extrapolation step would be suspect?

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-05-06T13:24:38.474Z · LW(p) · GW(p)

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):

Stalling:

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could give the GI component an ability to rewrite the weights or get more compute given today's architectures and training regimes.

(Unless you mean "SGD enhances the GI component until the GI component is able to hack into the substrate it is running on to access the memory containing its own weights, which it can then edit", though I feel like it is inaccurate to summarize this as "SGD give it more privileges", so probably you don't mean that)

(Or perhaps you mean "SGD creates a set of weights that effectively treats the input English tokens as a programming language by which the network's behavior can be controlled, and the GI component can then select tokens to output that both achieve low loss and also allow it to control its instincts on the next forward pass", but this also seems super exotic and is probably not what you mean.)

More or less, plus the theoretical argument from the apparent Turing-completeness of human understanding and the lack of empirical evidence to the contrary.

Interesting. Personally I would talk about humans generalizing to doing science as evidence for our general intelligence. The theoretical arguments + introspection are relatively minor bits of evidence relative to that, for me. I'm surprised it isn't the same for you.

(If you did buy that story though, then I'd think it should be possible in your view to have behavioral tests of AGI before it is so superintelligent that we've lost control.)

I'm not sure why the extrapolation step would be suspect?

It isn't suspect, sorry, I didn't mean to imply that.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-06T14:56:31.414Z · LW(p) · GW(p)

Discontinuity ending (without stalling):

Stalling:

Ah, makes sense.

Are you imagining systems that are built differently from today?

I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction).

Personally I would talk about humans generalizing to doing science as evidence for our general intelligence

"Doing science" is meant to be covered by "lack of empirical evidence that there's anything in the universe that humans can't model". Doing science implies the ability to learn/invent new abstractions, and we're yet to observe any limits to how far we can take it / what that trick allows us to understand.

(If you did buy that story though, then I'd think it should be possible in your view to have behavioral tests of AGI before it is so superintelligent that we've lost control.)

Mmm. Consider a scheme like the following:

  • Let  be the current date.
  • Train an AI on all of humanity's knowledge up to a point in time , where .
  • Assemble a list  of all scientific discoveries made in the time period .
  • See if the AI can replicate these discoveries.

At face value, if the AI can do that, it should be considered able to "do science" and therefore AGI, right?

I would dispute that. If the period  is short enough, then it's likely that most of the cognitive work needed to make the leap to any discovery in  is already present in the data up to . Making a discovery from that starting point doesn't necessarily require developing new abstractions/doing science — it's possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant:

  • No human knows everything the entire humanity knows. Imagine if making some discovery in  by interpolation required combining two very "distant" concepts, like a physics insight and advanced biology knowledge. It's unlikely that there'd be a human with sufficient expertise in both, so a human will likely do it by actual-science (e. g., a biologist would re-derive the physics insight from first principles).
  • An LLM, however, has a bird's eye view on the entire human concept-space up to . It directly sees both the physics insight and the biology knowledge, at once. So it can just do an interpolation between them, without doing truly-novel research.

Thus, the ability to produce marginal scientific insights may mean either the ability to "do science", or that the particular scientific insight is just a simple interpolation between already-known but distant concepts.

On the other hand, now imagine that the period  is very large, e. g. from 1940 to 2020. We'd then be asking our AI to make very significant discoveries, such that they surely can't be done by simple interpolation, only by actually building chains of novel abstractions [LW(p) · GW(p)]. But... well, most humans can't do that either, right? Not all generally-intelligent entities are scientific geniuses. Thus, this is a challenge a "weak" AGI would not be able to meet, only a genius/superintelligent AGI — i. e., only an AGI that's already an extinction threat.

In theory, there should be a pick of  that fits between the two extremes. A set of discoveries such that they can't be done by interpolation, but also don't require dangerous genius to solve.

But how exactly are we supposed to figure out what the right interval is? (I suppose it may not be an unsolvable problem, and I'm open to ideas, but skeptical on priors.)

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-05-06T15:15:24.748Z · LW(p) · GW(p)

Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.)

Fwiw, I feel like if I had your model, I'd be interested in:

  1. Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motivated the development of the field.
  2. Identifying "dangerous" changes to architectures, e.g. inference time reprogramming. Maybe we can simply avoid these architectures and stick with things that are more like LLMs.
  3. Hardening the world against mildly-superintelligent AI systems, so that you can study them / iterate on them more safely. (Incidentally, I don't buy the argument that mildly-superintelligent AI systems could clearly defeat us all. It's not at all clear to me that once you have a mildly-superintelligent AI system you'll have a billion mildly-superintelligent-AI-years worth of compute to run them.)
Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-06T15:22:28.565Z · LW(p) · GW(p)

I agree that those are useful pursuits.

I still disagree but it no longer seems internally inconsistent

Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-05-06T15:47:06.368Z · LW(p) · GW(p)

Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true.

You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying latent variables (e.g. "complexity and generality of the learned heuristics"), and "becoming universal" occurs when these fairly continuous trends exceed some threshold. For similar reasons I think active adaptability and goal-directedness will likely increase continuously, rather than being binary.

You might think that since I agree universality is binary that alone is enough to drive agreement with other points, but:

  1. I don't expect a discontinuous jump at the point you hit the universality property (because of the continuous trends), and I think it's plausible that current LLMs already have the capabilities to be "universal". I'm sure this depends on how you operationalize universality, I haven't thought about it carefully.
  2. I don't think that the problems significantly change character after you pass the universality threshold, and so I think you are able to iterate prior to passing it.
Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-06T16:08:30.099Z · LW(p) · GW(p)

Interesting, thanks.

I don't expect a discontinuous jump at the point you hit the universality property

Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients. But once the system can construct and work with self-made abstract objects, it can autonomously build chains of them [LW(p) · GW(p)] — and that causes a shift in the architecture and internal dynamics, because now its primary method of cognition is iterating on self-derived abstraction chains, instead of using hard-coded heuristics/modules. 

Replies from: rohinmshah
comment by Rohin Shah (rohinmshah) · 2023-05-06T16:22:19.807Z · LW(p) · GW(p)

I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones.

(Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-06T16:56:38.012Z · LW(p) · GW(p)

it's not clear why to expect the new feedback loop to be much more powerful than the existing ones

Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments.

And I think there's a decent amount of evidence for this. Consider that there are already narrow AIs that can solve protein folding more efficiently than our best manually-derived algorithms — which suggests that better algorithms are already uniquely constrained by the extant data, and we've just been unable to find them. Same may be true for all other domains of science — and thus, a superintelligence iterating on its own cognition would be able to outspeed human science.

comment by TAG · 2023-05-05T13:46:30.050Z · LW(p) · GW(p)

E. g., there’d be things like, we’d see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening

It's still quite possible that we a smarter than octopi, but not at some ceiling of intelligence.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-05T14:10:04.506Z · LW(p) · GW(p)

In the hypothetical where there's no general intelligence, there's no such thing as "smarter", only "has a cognitive algorithm specialized for X". If so, it's weird that there are no animals with cognitive algorithms that we lack; it's weird that we can model any animal's cognition, that we basically have duplicates of all of their cognitive machinery.

On the other hand, if there is such a thing as general intelligence in the sense of "can model anything", the explanation of why we can model any animal is straightforward.

Replies from: PeterMcCluskey
comment by PeterMcCluskey · 2023-05-06T02:54:07.933Z · LW(p) · GW(p)

In the hypothetical where there’s no general intelligence, there’s no such thing as “smarter”,

It sure looks like many species of animals can be usefully compared as smarter than others. The same is true of different versions of LLMs. Why shouldn't I conclude that most of those have what you call general intelligence?

comment by rotatingpaguro · 2023-05-03T12:56:59.986Z · LW(p) · GW(p)

About your opinion on LLMs probably not scaling to general intelligence:

What if the language of thought hypothesis [LW(p) · GW(p)] is correct, human intelligence can be represented as rules that manipulate natural language, the context window of LLMs is going to become long enough to match a human's "context window", and LLM training is able to find the algorithm?

How does this view fits into your model? What probabilities do you assign to the various steps?

  • language of thought hypothesis is correct
  • language of thought close enough to natural language
  • context window becomes long enough
  • transformers (or successor alternatives) do have the algorithm in their search space
  • training finds the algorithm
Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-03T13:48:01.383Z · LW(p) · GW(p)
  • I do think human thought can be represented as language-manipulation rules, but that's not a very interesting claim. Natural language is Turing-complete, of course anything can be approximated as rules for manipulating it. The same is true for chains of NAND gates. .
  • I don't think it's close to natural language in the meaningful sense. E. g., you can in fact think using raw abstractions, without an inner monologue, and it's much faster (= less compute-intensive) in some ways. I expect that's how we actually think, and the more legible inner monologue is more like a trick we're using to be able to convey our thoughts to other humans on the fly. A communication tool, not a cognition tool. Trying to use it for actual cognition will be ruinously compute-intensive. .
  • "Is the context window long enough?" seems like the wrong way to think about it. If we're to draw a human analogue, the context window would mirror working memory, and in this case, I expect it's already more "roomy" than human working memory (in some sense). The issue is that LLMs can't update their long-term memory (and no, on-line training ain't the answer to it). If we're limited to using the context window, then its length would have to be equivalent to a human's life... In which case, sure, interesting things may happen in an LLM scaled so far, but this seems obviously computationally intractable. .
  • Inasmuch as NNs can approximate any continuous function (and chain-of-thought prompting can allow arbitrary-depth recursion) — sure, transformers have general intelligence in their search-space, .
  • ... but the current training schemes, or any obvious tweaks to them, won't be able to find it.  This one I'm actually uncertain about, .
Replies from: rotatingpaguro
comment by rotatingpaguro · 2023-05-03T15:25:48.682Z · LW(p) · GW(p)

I don't think it's close to natural language in the meaningful sense. E. g., you can in fact think using raw abstractions, without an inner monologue, and it's much faster (= less compute-intensive) in some ways. I expect that's how we actually think, and the more legible inner monologue is more like a trick we're using to be able to convey our thoughts to other humans on the fly. A communication tool, not a cognition tool. Trying to use it for actual cognition will be ruinously compute-intensive. .

I know very logorrheic people who assert to think mostly verbally. Personally, I do a small amount of verbal thought, but sometimes resort to explicit verbal thinking on purpose to tackle problems I'm confused about. I think it would be sufficient that there exist some people who mostly reason verbally for the thesis to be valid for the purpose of guessing if LLMs are a viable path to intelligence. Do you think that even the most verbally-tuned people are actually doing the heavy lifting of their high-level thinking wordlessly?

"Is the context window long enough?" seems like the wrong way to think about it. If we're to draw a human analogue, the context window would mirror working memory, and in this case, I expect it's already more "roomy" than human working memory (in some sense). The issue is that LLMs can't update their long-term memory (and no, on-line training ain't the answer to it). If we're limited to using the context window, then its length would have to be equivalent to a human's life... In which case, sure, interesting things may happen in an LLM scaled so far, but this seems obviously computationally intractable. .

I expect that "plug-ins" that give a memory to the LLM, as people are already trying to develop, are viable. Do you expect otherwise? (Although they would not allow the LLM to learn new "instincts".)

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-03T15:37:28.942Z · LW(p) · GW(p)

Do you think that even the most verbally-tuned people are actually doing the heavy lifting of their high-level thinking wordlessly?

Yes. It's a distinction similar to whatever computations happen in LLM forward-passes vs. the way Auto-GPT exchanges messages with its subagents. Maybe it's also a memory aid, such that memorizing the semantic representation of a thought serves as a shortcut to the corresponding mental state [LW · GW]; but it's not the real nuts-and-bolts of cognition. The heavily lifting is done by whatever process figures out what word to put next in the monologue; not by the inner monologue itself.

I expect that "plug-ins" that give a memory to the LLM, as people are already trying to develop, are viable. Do you expect otherwise? (Although they would not allow the LLM to learn new "instincts".)

I think the instincts are the more crucial part, yes; perhaps I should've said "long-term adaptation" rather than "long-term memory".

I do suspect the current training processes fundamentally shape LLMs' architecture the wrong way, and not in a way that's easy to fix with fine-tuning, or conceptually-small architectural adjustments, or plug-ins. But that's my weakest claim, the one I'm only ~70% confident it. We'll see, I suppose.

Replies from: rotatingpaguro
comment by rotatingpaguro · 2023-05-03T16:03:58.783Z · LW(p) · GW(p)

The heavily lifting is done by whatever process figures out what word to put next in the monologue; not by the inner monologue itself.

It seems you use "monologue" in this sentence to refer to the sequence of words only, and then say that of course the monologue is not the cognition. With this I agree, but I don't think that's the correct interpretation of the combo "language of thought hypothesis" + "language of thought close to natural language". Having a "language of thought" means that there is a linear stream of items, and that your abstract cognition works only by applying some algorithm to the stream buffer to append the next item. The tape is not the cognition, but the cognition can be seen as acting (almost) only on the tape. Then "language of thought close to natural language" means that the language of thought has a short encoding in natural language. You can picture this as the language of thought of a verbal thinker being a more abstract version of natural language, similarly to when you feel what to say next but lack the word.

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-03T16:15:37.978Z · LW(p) · GW(p)

cognition can be seen as acting (almost) only on the tape

... If not for the existence of non-verbal cognition, which works perfectly well even without a "tape". Suggesting that the tape isn't a crucial component, that the heavy lifting can be done by the abstract algorithm alone, and therefore that even in supposed verbal thinkers, that algorithm is likely what's doing the actual heavy lifting.

In my view, there's an actual stream of abstract cognition, and a "translator" function mapping from that stream to human language. When we're doing verbal thinking, we're constantly running the translator on our actual cognition, which has various benefits (e. g., it's easier to translate our thoughts to other humans); but the items in the natural-language monologue are compressed versions of the items in the abstract monologue, and they're strictly downstream of the abstract stream.

Replies from: rotatingpaguro
comment by rotatingpaguro · 2023-05-03T18:05:25.936Z · LW(p) · GW(p)

So you think

  1. There's a "stream" of abstract thought, or "abstract monologue"
  2. The cognition algorithm operates on/produces the abstract stream
  3. Natural language is a compressed stream of the abstract stream

Which seems to me the same thing I said above, unless maybe you are also implying either or both of these additional statements:

a) The abstract cognition algorithm can not be seen as operating mostly autoregressively on its "abstract monologue";

b) The abstract monologue can not be translated to a longer, but boundedly longer, natural language stream (without claiming that this is what happens typically when someone verbalizes).

Which of (a), (b) do you endorse, eventually with amendments?

Replies from: Thane Ruthenis
comment by Thane Ruthenis · 2023-05-03T18:47:31.017Z · LW(p) · GW(p)

Which of (a), (b) do you endorse, eventually with amendments?

I don't necessarily endorse either. But "boundedly longer" is what does a lot of work there. As I'd mentioned, cognition can also be translated into a finitely long sequence of NAND gates. The real question isn't "is there a finitely-long translation?", but how much longer that translation is.

And I'm not aware of any strong evidence suggesting that natural language is close enough to human cognition that the resultant stream would not be much longer. Long enough to be ruinously compute-intensive (effectively as ruinous as translating it into NAND-gate sequences).

Indeed, I'd say there's plenty of evidence to the contrary, given how central miscommunication is to the human experience.

comment by TAG · 2023-05-03T12:27:00.831Z · LW(p) · GW(p)

Human intelligence is Turing-complete

That may be true, but it isn't an argument for general intelligence in itself.

Theres a particular problem in that the more qualitatively flexible part of the mind...the conscious mind, or system 2...is very limited on its ability to follow a programme, only being able to follow tens of steps reliably. Whereas system 1 is much more powerful but much less flexible.

comment by Aorou (Adnll) · 2023-05-03T08:00:14.809Z · LW(p) · GW(p)

A general intelligence may also be suppressed by an instinct firing off, as sometimes happens with humans. But that’s a feature of the wider mind the GI is embedded in, not of general intelligence itself.


I actually think you should count that as evidence against your claim that humans are General Intelligences. 

comment by Aorou (Adnll) · 2023-05-03T07:50:35.875Z · LW(p) · GW(p)

Qualitatively speaking, human cognition is universally capable.


 How would we know if this wasn't the case? How can we test this claim?
My initial reaction here is to think "We don't know what we don't know".