Current AIs Provide Nearly No Data Relevant to AGI Alignment

thane-ruthenis

Current AIs Provide Nearly No Data Relevant to AGI Alignment

post by Thane Ruthenis · 2023-12-15T20:16:09.723Z · LW · GW · 157 comments

  What the Fuss Is All About
  So What About Current AIs?
  On Safety Guarantees
  A Concrete Scenario
  Closing Summary
None
158 comments

Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take [LW · GW]).

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis [? · GW] of its own. By comparison, the "canonical" takes are almost purely theoretical.

At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.

I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.

It's clearer if you taboo the word "AI":

The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.

It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.

It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.

By the same token, LLMs' algorithms do not necessarily generalize to how an AGI's cognition will function. Their limitations are not necessarily an AGI's limitations.^[1]

What the Fuss Is All About

To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.

Consider humans. Some facts:

Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
And when people with different values interact...
- People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
- People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks.

So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.

The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious "g-factor". There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.

Systems like this, systems the possibility of whose existence is extrapolated from humans' existence, are precisely what we're worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.

The only systems in this reference class known to us are humans, and some human collectives.

Viewing it from another angle, one can say that the systems we're concerned about are defined as cognitive systems in the same reference class as humans.

So What About Current AIs?

Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.

Indeed, that's often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a "homunculus" inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.

The novel arguments are often based around arguing that there's no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi's formation.

And I agree! I think those arguments are right.

But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.

Which isn't really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we've essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don't seem to agree with how human cognition feels from the inside?

And I do dispute that implicit claim.

I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF'd masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that'd try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?

Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.

Studying gorilla neurology isn't going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.

Similarly, studying LLMs isn't necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.

The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.

On Safety Guarantees

That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we're not on-track to build them, so who cares?

On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.

I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don't believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.

Omnicide-wise, arbitrarily-big LLMs should be totally safe.

The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.

They're a powerful technology in their own right, yes. But just that: just another technology. Not something that's going to immanentize the eschaton.

Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they're the same limit.

Current AIs are safe, in practice and in theory, because they're not as scarily generally capable as humans. On the flip side, current AIs aren't as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.

So if you figure out how to remove the capability upper bound, you'll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.

And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They're not here just to have fun scaling LLMs. So inasmuch as I'm right that LLMs and such are not AGI-complete, they'll eventually move on from them, and find some approach that does lead to AGI.

And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.

A Concrete Scenario

Here's a very specific worry of mine.

Take an AI Optimist who'd built up a solid model of how AIs trained by SGD work. Based on that, they'd concluded that the AGI Omnicide Risk arguments don't apply to such systems. That conclusion is, I argue, correct and valid.

The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they're not endangering the world, and are instead helping to usher in a new age of prosperity.

Eventually, they notice or realize some architectural limitation of the paradigm they're working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don't notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.

They fail to update the cached thought of "AI is safe".

And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they'd been previously working on, but an actual AGI.

That AGI has a scheming homunculus deep inside. The people working with it don't believe in homunculi, they have convinced themselves those can't exist, so they're not worrying about that. They're not ready to deal with that, they don't even have any interpretability tools pointed in that direction.

The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)

And then everyone dies.

The point is that the safety guarantees that the current optimists' arguments are based on are not simply fragile, they're being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they'll give out under the optimization pressures being applied – and it'll be easy to miss the moment the break happens. It'd be easy to cache the belief of, say, "LLMs are safe", then introduce some architectural tweak, keep thinking of your system as "just an LLM with some scaffolding and a tiny tweak", and overlook the fact that the "tiny tweak" invalidated "this system is an LLM, and LLMs are safe".

Closing Summary

I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the "canonical" least-forgiving take on alignment, are both correct. They're just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).

That view isn't unreasonable. Same way it's not unreasonable to claim that studying GOFAI algorithms wouldn't shed much light on LLM cognition, despite them both being advanced AIs.

Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn't solidly backed, empirically or theoretically. If anything, it's the opposite: the arguments for current AIs' safety are based on arguing that they're incapable-by-design of engaging in human-style scheming.

That doesn't guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what's hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They'll figure out how to remove them... and in that very act, they will remove the safety guarantees. The systems they're working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.

The class to which the classical AGI Omnicide Risk arguments apply full-force.

The class for which no known alignment technique suffices [LW · GW].

And that switch would be very easy, yet very lethal, to miss.

^{^}
Slightly edited for clarity after an exchange with Ryan [LW(p) · GW(p)].

157 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2023-12-15T22:02:35.176Z · LW(p) · GW(p)

Here are two specific objections to this post^[1]:

AIs which aren't qualitatively smarter than humans could be transformatively useful (e.g. automate away all human research).
It's plausible that LLM agents will be able to fully obsolete human research while also being incapable of doing non-trivial consequentialist reasoning in just a forward pass (instead they would do this reasoning in natural language).

AIs which aren't qualitatively smarter than humans could be transformatively useful

Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.

In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.

However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper [LW · GW]. Thus, approaches like AI control [LW · GW] could be very useful.

LLM agents with weak forward passes

IMO, there isn't anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can't do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses [LW · GW] would fully apply. In the world where AIs basically can't do invisible non-trivial consequentialist reasoning, most misalignment threat models don't apply. (Scheming [LW · GW]/deceptive alignmment and cleverly playing the training game [LW · GW] both don't apply.)

I think it's unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.

I edited this comment from "I have two main objections to this post:" because that doesn't quite seem like a good description of what this comment is saying. See this other comment [LW(p) · GW(p)] for more meta-level commentary. ↩︎

Replies from: bogdan-ionut-cirstea, daniel-kokotajlo, Thane Ruthenis, None

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-12-16T13:45:11.883Z · LW(p) · GW(p)

I think it's unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.

I expect the probability to be >> 15% for the following reasons.

There will likely still be incentives to make architectures more parallelizable (for training efficiency) and parallelizable architectures will probably be not-that-expressive in a single forward pass (see The Parallelism Tradeoff: Limitations of Log-Precision Transformers). CoT is known to increase the expressivity of Transformers, and the longer the CoT, the greater the gains (see The Expressive Power of Transformers with Chain of Thought). In principle, even a linear auto-regressive next-token predictor is Turing-complete, if you have fine-grained enough CoT data to train it on, and you can probably tradeoff between length (CoT supervision) complexity and single-pass computational complexity (see Auto-Regressive Next-Token Predictors are Universal Learners). We also see empirically that CoT and e.g. tools (often similarly interpretable) provide extra-training-compute-equivalent gains (see AI capabilities can be significantly improved without expensive retraining). And recent empirical results (e.g. Orca, Phi, Large Language Models as Tool Makers) suggest you can also use larger LMs to generate synthetic CoT-data / tools to train smaller LMs on.

This all suggests to me it should be quite likely possible (especially with a large, dedicated effort) to get to something like a ~human-level automated alignment researcher with a relatively weak forward pass.

For an additional intuition why I expect this to be possible, I can conceive of humans who would both make great alignment researchers while doing ~all of their (conscious) thinking in speech-like inner monologue and would also be terrible schemers if they tried to scheme without using any scheming-relevant inner monologue; e.g. scheming/deception probably requires more deliberate effort for some people on the ASD spectrum.

Replies from: daniel-kokotajlo, bogdan-ionut-cirstea, bogdan-ionut-cirstea, bogdan-ionut-cirstea, bogdan-ionut-cirstea

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-09T18:46:38.506Z · LW(p) · GW(p)

I agree & think this is pretty important. Faithful/visible CoT is probably my favorite alignment strategy.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-12T23:44:28.823Z · LW(p) · GW(p)

I think o1 is significant evidence in favor of the story here; and I expect OpenAI's model to be further evidence still if, as rumored, it will be pretrained on CoT synthetic data.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-04-16T19:52:26.130Z · LW(p) · GW(p)

The weak single forward passes argument also applies to SSMs like Mamba for very similar theoretical reasons [LW(p) · GW(p)].

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-03-19T12:37:39.426Z · LW(p) · GW(p)

One additional probably important distinction / nuance: there are also theoretical results for why CoT shouldn't just help with one-forward-pass expressivity, but also with learning. E.g. the result in Auto-Regressive Next-Token Predictors are Universal Learners is about learning; similarly for Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks, Why Can Large Language Models Generate Correct Chain-of-Thoughts?, Why think step by step? Reasoning emerges from the locality of experience, Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks.

The learning aspect could be strategically crucial with respect to what the first transformatively-useful AIs should look like [LW(p) · GW(p)]; also see e.g. discussion here [LW(p) · GW(p)] and here [LW(p) · GW(p)]. In the sense that this should add further reasons to think the first such AIs should probably (differentially) benefit from learning from data using intermediate outputs like CoT; or at least have a pretraining-like phase involving such intermediate outputs, even if this might be later distilled or modified some other way - e.g. replaced with [less transparent] recurrence.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-01-17T14:23:47.401Z · LW(p) · GW(p)

More complex tasks 'gaining significantly from longer inference sequences' also seems beneficial to / compatible with this story.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-12-15T23:25:02.023Z · LW(p) · GW(p)

I think those objections are important to mention and discuss, but they don't undermine the conclusion significantly.

AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP's argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.

As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we'd be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don't think it undermines the OP's points though? We are not currently on a path to have robust faithful CoT properties by default.

Replies from: ryan_greenblatt, valley9

↑ comment by ryan_greenblatt · 2023-12-15T23:53:58.309Z · LW(p) · GW(p)

This post seemed overconfident in a number of places, so I was quickly pushing back in those places.

I also think the conclusion of "Nearly No Data" is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn't seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.

If this post argued "the fact that current chat bots trained normally don't seem to exhibit catastrophic misalignment isn't much evidence about catastrophic misalignment in more powerful systems", then I wouldn't think this was overstated (though this also wouldn't be very original). But, it makes stronger claims which seem false to me.

Replies from: Thane Ruthenis, daniel-kokotajlo

↑ comment by Thane Ruthenis · 2023-12-16T05:21:07.182Z · LW(p) · GW(p)

Mm, I concede that this might not have been the most accurate title. I might've let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.

My core point is something like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".

I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-12-15T23:55:03.518Z · LW(p) · GW(p)

OK, that seems reasonable to me.

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-16T10:46:42.326Z · LW(p) · GW(p)

We are not currently on a path to have robust faithful CoT properties by default.

Is there a citation for this?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-12-18T05:36:16.893Z · LW(p) · GW(p)

What kind of citation are you looking for? Are you basically just asking me to provide evidence, or are you asking me to make an object-level argument (as opposed to e.g. an appeal to authority)? Or something else entirely, e.g. a publication?

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-18T05:54:55.724Z · LW(p) · GW(p)

You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.

If the argument hasn't been made somewhere, perhaps you could write a short post making that argument. Could be a good way to either catalyze research in the area (you stated that you wish to encourage such research), or else convince people that the challenge is insurmountable and a different approach is needed.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-12-18T06:32:50.941Z · LW(p) · GW(p)

You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.4

You may have interpreted it that way, but I certainly don't follow a policy of prefacing everything with "in my opinion" unless I have a citation ready. I bet you don't either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said "it is well-known that..."

Anyhow, I'm happy to defend the claim here. It would help if I knew where to begin. I'll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.

First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.

Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the future self then reads in input. This gives us hope that it'll conveniently-for-us put all the important thoughts in natural language, so that e.g. it won't leave out anything important such as "the reason I'm telling you this is not because I'm trying to help you but rather because I think this'll cause me to score more highly" etc. Here's a post articulating this hope. [LW · GW]

Third: Alas, the incentives mentioned above will probably crush this hope. Maybe if we didn't do RLHF for example, and just used pre-trained models with clever prompts, we'd be fine. But by default we'll apply RLHF to the whole chain of thought, and not leave any secret scratchpad, for example. The post above talks about this briefly.

Fourth: Even if that wasn't true, the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don't have a natural language bottleneck. My guess is that they'll come into play right around the time we hit AGI.

My positive hope is that we can basically draw attention to this issue and figure out what sorts of training schemes create & preserve faithful CoT properties, and which don't. (Seems pretty doable to me, it just takes some engineer and scientist legwork) and then coordinate governments and labs to resist the competitive pressure to train using schemes that don't have those properties (also doable, if we had well-established science of those properties, which we don't yet).

Replies from: Filip Sondej, valley9

↑ comment by Filip Sondej · 2024-01-17T15:31:45.066Z · LW(p) · GW(p)

the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don't have a natural language bottleneck

This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)

What existing alternative architectures do you have in mind? I guess mamba would be one?

Do you think it's realistic to regulate this? F.e. requiring that above certain size, models can't have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-17T18:08:36.174Z · LW(p) · GW(p)

I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety.

As it is, it seems unlikely to be politically viable... but maybe it's still worth a shot?

Replies from: Filip Sondej

↑ comment by Filip Sondej · 2024-01-17T19:51:38.911Z · LW(p) · GW(p)

Yeah, true. But it's also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there's less resistance, it doesn't break anyone's plans.

Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won't see clear differences yet.) The hard part would be finding the right tasks.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-17T21:55:39.084Z · LW(p) · GW(p)

Agreed. I was working on this for six months and I've been trying to get more people to work on it.

We don't have a way of measuring CoT faithfulness as far as I know, in general -- but you emphasize "tasks where we can evaluate..." that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?

Replies from: Filip Sondej, Filip Sondej

↑ comment by Filip Sondej · 2024-01-18T12:25:04.912Z · LW(p) · GW(p)

Unfortunately I didn't have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:

Now that I though about it, for this particular transformers vs mamba experiment, I'd go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:

x = 5
x += 2
x *= 3
x **= 2
x -= 3
...

and then have a CoT:

after x = 5
5
after x += 2
7
...

And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.

(It's an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)

Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won't have a limit.

I was working on this for six months

Can you say what you tried in these six months and how did it go?

↑ comment by Filip Sondej · 2024-12-16T14:17:24.429Z · LW(p) · GW(p)

FYI, I did the experiments I wrote about in my other comment and just posted them. (I procrastinated writing up the results for too long.) https://www.lesswrong.com/posts/ZB6guMhHH3NEyxA2k/testing-which-llm-architectures-can-do-hidden-serial-3 [LW · GW]

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-18T23:56:42.814Z · LW(p) · GW(p)

Sorry, I think I must have misunderstood your comment. When you wrote:

As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we'd be in pretty damn good shape from an AI control perspective.

I interpreted this to mean that in addition to weak forward passes, there was another thing called "robust faithful CoT properties" that would also need to be achieved.

I now think you meant to indicate that "weak forward passes" was a particular method for achieving "robust faithful CoT properties".

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-12-19T11:23:56.627Z · LW(p) · GW(p)

Ah, I think I see how I was being unclear maybe now, thanks for that question. It may help for me to go through the original quote and say what I do and don't agree with:

IMO, there isn't anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can't do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses [LW · GW] would fully apply. In the world where AIs basically can't do invisible non-trivial consequentialist reasoning, most misalignment threat models don't apply. (Scheming [LW · GW]/deceptive alignmment and cleverly playing the training game [LW · GW] both don't apply.)

I agree with everything in this quote; however, I think that "assuming that we can also rule out steganography and similar concerns" is a big and unwarranted assumption, for reasons mentioned above. Our training schemes will probably incentivize steganography and subtle strategic bias.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-21T10:59:20.583Z · LW(p) · GW(p)

If I understand correctly, sophisticated steganography will not be incentivized by a language model that's just trained on next-token prediction (no RHLF). Is that right? (Since the model didn't generate any of the text in its context window, there's no method it could use to pass info to itself. The best it could do is pick up on weak-to-nonexistent human-created steganography in its dataset.)

That suggests a simple method to prevent steganography: Instead of having human raters rate RHLF-generated responses, have humans write responses themselves, then finetune on the human-written responses.

That might sound labor-intensive, but I don't think it has to be. Instead of scoring your AI's response, you check if the AI-written response is adequate. If it's adequate, you do nothing. If it's inadequate, you send the question to an expert who writes an expert-level response, then add that response to the finetuning dataset.

Sure, experts are expensive, but so are massive training runs. We know data quality is important at the language corpus stage. Why wouldn't it be equally important at the finetuning stage? (Also, I'm not sure experts are that expensive, given the glut of PhDs in many fields. And expertise in one field could transfer to others, if you split the expert task into 2 steps: identification of high-quality sources, and use of those sources to answer the user's query.)

In any case, I imagine it should be possible to detect steganography to an extent, by adding noise to the contents of the context window and checking if that causes unexpectedly high performance degradation. You could have a second AI that creates a paraphrased view of everything in the context window except the most recent 20 words or so -- I expect that's a massive blow to steganography in practice. Note that the cost of such added-noise schemes only needs to be paid at training time, so they can be expensive -- e.g. ensembling across the next-token generated from many different distortions of the current text.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2023-12-21T15:08:22.280Z · LW(p) · GW(p)

Yep, I think the thing you propose (only training next-token-prediction on human-written text) would probably work great to prevent sophisticated steganography. Let's try to calculate how much it would cost though.

...I mean if you want to do the equivalent of a modern large training run you'll need trillions of tokens of expert-generated text. So that's a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren't slacking off or cheating or trying to poison your dataset?

I guess it's doable. At least for something like one trillion tokens; you probably just can't do fifty trillion tokens even if you were the CCP.

...but anyhow good luck convincing labs to do this. :/ In general I think there are lots of promising strategies for solving technical alignment and control problems. And people might even be willing to pay the costs to implement these strategies, if it was uncontroversially clear that the strategy would work and the default (which is much cheaper and faster) would not work. But instead it'll be muddled and controversial and the costly workable strategy will be untested, and no one will want to pay the costs to go test it, at least not for now.

As for using paraphrasing to detect steganography: Yes, I'm hopeful about that idea and have been trying to get people to run experiments on it and see the extent to which it works in practice.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-21T16:09:08.049Z · LW(p) · GW(p)

...I mean if you want to do the equivalent of a modern large training run you'll need trillions of tokens of expert-generated text. So that's a million experts generating a million tokens each? So, basically a million experts working full-time for years? So something like a hundred billion dollars minimum just to pay them all, plus probably more for the bureaucratic infrastructure needed to ensure they aren't slacking off or cheating or trying to poison your dataset?

Where are these numbers coming from? They seem way too high. My suggestion is to do a modern large training run in the standard way (next-token prediction), and then fine-tune on experts playing the role of a helpful/honest/harmless chatbot doing CoT. Basically replace RLHF with finetuning on expert chatbot roleplay. Maybe I'm betraying my ignorance here and this idea doesn't make sense for some reason?

I was editing my comment a fair amount, perhaps you read an old version of it?

And, in terms of demonstrating feasibility, you don't need to pay any experts to demonstrate the feasibility of this idea. Just take a bunch of ChatGPT responses that are known to be high quality, make a dataset out of them, and use them in the training pipeline I propose, as though they were written by human experts. Then evaluate the quality of the resulting model. If it's nearly as good as the original ChatGPT, I think you should be good to go.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-01-17T18:06:55.585Z · LW(p) · GW(p)

I said "if you want to do the equivalent of a modern large training run." If your intervention is just a smaller fine-tuning run on top of a standard LLM, then that'll be proportionately cheaper. And that might be good enough. But maybe we won't be able to get to AGI that way.

Worth a shot though.

↑ comment by Thane Ruthenis · 2023-12-16T05:13:43.971Z · LW(p) · GW(p)

On my inside model [LW · GW] of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have. That is a strong claim, yes, but I am making it.

I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without "taking off". Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be "mere LLMs" as far as capability and safety upper bounds go.

That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we'll get to AGI by architectural advances before we have enough compute to make a CoT'd LLM take off.

But that's a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won't share, for obvious reasons). I won't be totally flabbergasted if some particularly clever way of doing that worked.

Replies from: ryan_greenblatt, faul_sname, Seth Herd

↑ comment by ryan_greenblatt · 2023-12-16T05:24:53.834Z · LW(p) · GW(p)

On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have.

I actually basically agree with this quote.

Note that I said "incapable of doing non-trivial consequentialist reasoning in a forward pass". The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I'll try to clarify this in my comment.

↑ comment by faul_sname · 2023-12-16T18:50:18.098Z · LW(p) · GW(p)

How about "able to automate most simple tasks where it has an example of that task being done correctly"? Something like that could make researchers much more productive. Repeat the "the most time consuming part of your workflow now requires effectively none of your time or attention" a few dozen times and that does end up being transformative compared to the state before the series of improvements.

I think "would this technology, in isolation, be transformative" is a trap. It's easy to imagine "if there was an AI that was better at everything than we do, that would be tranformative", and then look at the trend line, and notice "hey, if this trend line holds we'll have AI that is better than us at everything", and finally "I see lots of proposals for safe AI systems, but none of them safely give us that transformative technology". But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T18:59:52.449Z · LW(p) · GW(p)

I'm not particularly concerned about AI being "transformative" or not. I'm concerned about AGI going rogue and killing everyone. And LLMs automatic workflow is great and not (by itself) omnicidal at all, so that's... fine?

But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.

As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we've made the first steps?

The problem is that boosts to human productivity also boost the speed at which we're getting to that endpoint, and there's no reason to think they differentially improve our ability to make things safe [LW · GW]. So all that would do is accelerate us harder as we're flying towards the wall at a lethal speed.

Replies from: faul_sname

↑ comment by faul_sname · 2023-12-16T19:10:50.995Z · LW(p) · GW(p)

As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we've made the first steps?

I don't expect it to be helpful to block individually safe steps on this path, though it would probably be wise to figure out what unsafe steps down this path look like concretely (which you're doing!).

But yeah. I don't have any particular reason to expect "solve for the end state without dealing with any of the intermediate states" to work. It feels to me like someone starting a chat application and delaying the "obtain customers" step until they support every language, have a chat architecture that could scale up to serve everyone, and have found a moderation scheme that works without human input.

I don't expect that team to ever ship. If they do ship, I expect their product will not work, because I think many of the problems they encounter in practice will not be the ones they expected to encounter.

↑ comment by Seth Herd · 2023-12-22T21:20:58.430Z · LW(p) · GW(p)

Interesting. My own musings regarding how an AGI based on scaffolded LLMs seems like it would not be prohibitively computationally expensive. Expensive, yes, but affordable in large projects.

It seems to me like para-human-level AGI is quite achievable with language model agents, but advancing beyond the human intelligence that created the LLM training set might be much slower. That could be a really good scenario.

The excellent On the future of language models [LW · GW] raises that possibility.

You've probably seen my Capabilities and alignment of LLM cognitive architectures [LW · GW]. I published that because it all of the ideas there seemed pretty obvious. To me those obvious improvements (a bit of work on episodic memory and executive function) lead to AGI with just maybe 10x more LLM calls than vanilla prompting (varying with problem/plan complexity of course). I've got a little more thinking beyond that which I'm not sure I should publish.

↑ comment by [deleted] · 2023-12-16T00:35:29.358Z · LW(p) · GW(p)

IMO, there isn't anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can't do non-trivial consequentialist reasoning in a forward pass.

Why not control the inputs more tightly/choose the response tokens at temperature=0?

Example:

Prompt A : Alice wants in the door

Promt B: Bob wants in the door

Available actions: 1. open, 2. keep_locked, 3. close_on_human

I believe you are saying with a weak forward pass the model architecture would be unable to reason "I hate Bob and closing the door on Bob will hurt Bob", so it cannot choose (3).

But why not simply simplify the input? Model doesn't need to know the name.

Prompt A: <entity ID_VALID wants in the door>

Prompt B: <entity ID_NACK wants in the door>

Restricting the overall context lets you use much more powerful models you don't have to trust, and architectures you don't understand.

comment by jacob_cannell · 2023-12-16T01:40:07.051Z · LW(p) · GW(p)

Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:

fast takeoff is more likely than slow, downstream dependent on some combo of:

continuation of Moore's Law
feasibility of hard 'diamondoid' nanotech
brain efficiency [LW · GW] vs AI
AI hardware (in)-dependence

the inherent 'alien-ness' of AI and AI values
supposed magical coordination advantages of AIs
arguments from analogies: namely evolution

These arguments are old enough that we can now update based on how the implicit predictions of the implied worldviews turned out. The traditional EY/MIRI/LW view has not aged well, which in part can be traced to its dependence on an old flawed theory [LW(p) · GW(p)] of how the brain works.

For those who read HPMOR/LW in their teens/20's, a big chunk of your worldview is downstream of EY's and the specific positions he landed on with respect to key scientific questions around the brain and AI. His understanding of the brain came almost entirely from ev psych and cognitive biases literature and this model in particular - evolved modularity - hasn't aged well and is just basically wrong. So this is entangled with everything related to AI risk (which is entirely about the trajectory of AI takeoff relative to human capability).

It's not a coincidence that many in DL/neurosci have a very different view (shards etc). In particular the Moravec view that AI will come from reverse engineering the brain, that progress is entirely hardware constrained and thus very smooth and predictable, that is the view turned out to be mostly all correct. (his late 90's prediction of AGI around 2028 is especially prescient)

So it's pretty clear EY/LW was wrong on 1. - the trajectory of takeoff and path to AGI, and Moravec et al was correct.

Now as the underlying reasons are entangled, Moravec et al was also correct on point 2 - AI from brain reverse engineering is not alien! (But really that argument was just weak regardless.) EY did not seriously consider that the path to AGI would involve training massive neural networks to literally replicate human thoughts.

Point 3 Isn't really taken seriously outside of the small LW sphere. By the very nature of alignment being a narrow target, any two random Unaligned AIs are especially unlikely to be aligned with each other. The idea of a magical coordination advantage is based on highly implausible code sharing premises (sharing your source code is generally a very bad idea, and regardless doesn't and can't actually prove that the code you shared is the code actually running in the world - the grounding problem is formidable and unsolved)

The problem with 4 - the analogy from evolution - is that it factually contradicts the doom worldview - evolution succeeded in aligning brains to IGF well enough despite a huge takeoff in the speed of cultural evolution over genetic evolution - as evidence by the fact that humans have one of the highest fitness scores of any species ever, and almost certainly the fastest growing fitness score.

Replies from: Thane Ruthenis, Seth Herd, None

↑ comment by Thane Ruthenis · 2023-12-16T05:50:06.661Z · LW(p) · GW(p)

You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate [LW · GW] the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitrarily slip their leash, that "general intelligence" isn't a compact algorithm, and so on. None of that relies on nanobots/Moore's law/etc.

What you've outlined might or might not be the relevant historical reasons for how Eliezer/the LW community arrived at some of their takes. But the takes themselves, or at least the subset of them that I care about, are independent of these historical reasons.

fast takeoff is more likely than slow

Fast takeoff isn't load-bearing [LW · GW] on my model. I think it's plausible for several reasons, but I think a non-self-improving human-genius-level AGI would probably be enough to kill off humanity.

the inherent 'alien-ness' of AI and AI values

I do address that? The values of two distant human cultures are already alien enough for them to see each other as inhuman and wish death on each other. It's only after centuries of memetic mutation that we've managed to figure out how to not do that (as much).

supposed magical coordination advantages of AIs

I don't think one needs to bring LDT/code-sharing stuff there in order to show intuitively how that'd totally work. "Powerful entities oppose each other yet nevertheless manage to coordinate to exploit the downtrodden masses" is a thing that happens all the time in the real world. Political/corporate conspiracies, etc.

"Highly implausible code-sharing premises" is part of why it'd be possible in the AGI case, but it's just an instance of the overarching reason. Which is mainly about the more powerful systems being able to communicate with each other at higher bandwidth than with the weaker systems, allowing them to iterate on negotiations quicker and advocate for their interests during said negotiations better. Which effectively selects the negotiated outcomes for satisfying the preferences of powerful entities while effectively cutting out the weaker ones.

(Or something along these lines. That's an off-the-top-of-my-head take; I haven't thought on this topic much because multipolar scenarios isn't something that's central to my model. But it seems right.)

arguments from analogies: namely evolution

Yeah, we've discussed [LW(p) · GW(p)] that some recently, and found points of disagreement. I should flesh out my view on how it's applicable vs. not applicable later on, and make a separate post about that.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-12-16T07:11:49.356Z · LW(p) · GW(p)

Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate [LW · GW] the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories

They are critically relevant. From your own linked post ( how I delineate [LW · GW] ) :

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes [LW · GW] if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (an outcome which brain efficiency - as a concept - predicts in advance).

We need to align the AGI's values precisely right.

Not really - if the AGI is very similar to uploads, we just need to align them about as well as humans. Note this is intimately related to 1. and the technical relation between AGI and brains. If they are inevitably very similar then much of the classical AI risk argument dissolves.

You seem to be - like EY circa 2009 - in what I would call the EMH brain [LW · GW] camp, as opposed to the ULM camp. It seems given the following two statements, you would put more weight on B than A:

A. The unique intellectual capabilities of humans are best explained by culture: our linguistically acquired mental programs, the evolution of which required vast synaptic capacity and thus is a natural emergent consequence of scaling.

B. The unique intellectual capabilities of humans are best explained by a unique architectural advance via genetic adaptations: a novel 'core of generality'^[1] that differentiates the human brain from animal brains.

This is a EY term; and if I recall correctly he still uses it fairly recently. ↩︎

Replies from: lahwran, Thane Ruthenis

↑ comment by the gears to ascension (lahwran) · 2023-12-16T08:00:44.991Z · LW(p) · GW(p)

There probably really is a series of core of generality insights in the difference between general mammal brain scaled to human size -> general primate brain scaled to human size -> actual human brain. Also, much of what matters is learned from culture. Both can be true at once.

But more to the point, I think you're jumping to conclusions about what OP thinks. They haven't said anything that sounds like EMH nonsense to me. Modularity is generated by runtime learning, and mechinterp studies it; there's plenty of reason to think there might be ways to increase it, as you know. And that doesn't even touch on the question of what training data.

↑ comment by Thane Ruthenis · 2023-12-16T08:03:23.930Z · LW(p) · GW(p)

If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes [LW · GW] if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (which brain efficiency predicts in advance).

My argument for the sharp discontinuity routes through [LW · GW] the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it.

Brains are very different architectures compared to our computers, in any case, they implement computations in very different ways. They could be maximally efficient relative to their architectures, but so what? It's not at all obvious that FLOPS estimates of brainpower are highly relevant to predicting when our models would hit AGI, any more than the brain's wattage is relevant [LW · GW].

They're only soundly relevant if you're taking the hard "only compute matters, algorithms don't" position, which I reject.

It seems given the following two statements, you would put more weight on B than A:

I think both are load-bearing, in a fairly obvious manner, and that which specific mixture is responsible matters comparatively little.

Architecture obviously matters. You wouldn't get LLM performance out of a fully-connected neural network, certainly not at realistically implementable scales. Even more trivially, you wouldn't get LLM performance out of an architecture that takes in the input, discards it, spends 10^25 FLOPS generating random numbers, then outputs one of them. It matters how your system learns.
- So evolution did need to hit upon, say, the primate architecture, in order to get to general intelligence.
Training data obviously matters. Trivially, if you train your system on randomly-generated data, it's not going to learn any useful algorithms, no matter how sophisticated its architecture is. More realistically, without the exposure to chemical experiments, or any data that hints at chemistry in any way, it's not going to learn how to do chemistry.
- Similarly, a human not exposed to stimuli that would let them learn the general-intelligence algorithms isn't going to learn them. You'd brought up feral children before, and I agree it's a relevant data point.

So, yes, there would be no sharp left turn caused by the AIs gradually bootstrapping a culture, because we're already feeding them the data needed for that.

But that only means the sharp left turn caused by the architectural-advance part – the part we didn't yet hit upon, the part that's beyond LLMs, the "agency overhang" – would be that much sharper. The AGI, once we hit on an architecture that'd accommodate its cognition, would be able to skip the hundreds of years of cultural evolution.

Edit:

You seem to be - like EY circa 2009 - in what I would call the EMH brain [LW · GW] camp

Nope. I'd read e. g. Steve Byrnes' sequence [? · GW], I agree that most of the brain's algorithms are learned from scratch.

Replies from: jacob_cannell, bogdan-ionut-cirstea

↑ comment by jacob_cannell · 2023-12-16T19:08:41.902Z · LW(p) · GW(p)

My argument for the sharp discontinuity routes through [LW · GW] the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it.

You claim later to agree with ULM (learning from scratch) over evolved-modularity, but the paragraph above and statements like this in your link:

The homo sapiens sapiens spent thousands of years hunter-gathering before starting up civilization, even after achieving modern brain size.

It would still be generally capable in the limit, but it wouldn't be instantly omnicide-capable.

So when the GI component first coalesces,

Suggest to me that you have only partly propagated the implications of ULM and the scaling hypothesis. There is no hard secret to AGI - the architecture of systems capable of scaling up to AGI is not especially complex to figure out, and has in fact been mostly known for decades (schmidhuber et al figured most of it out long before the DL revolution). This is all strongly implied by ULM/scaling, because the central premise of ULM is that GI is the result of massively scaling up simple algorithms and architectures. Intelligence is emergent from scaling simple algorithms, like complexity emerges from scaling of specific simple cellular automata rules (ie life).

All mammal brains share the same core architecture - not only is there nothing special about the human brain architecture, there is not much special about the primate brain other than hyperpameters better suited to scaling up to our size ( a better scaling program). I predicted the shape of transformers (before the first transformers paper) and their future success with scaling in 2015, but also see the Bitter Lesson from 2019.

It's not at all obvious that FLOPS estimates of brainpower are highly relevant to predicting when our models would hit AGI, any more than the brain's wattage is relevant [LW · GW].

That post from EY starts with a blatant lie - if you actually have read Mind Children, Moravec predicted AGI around 2028 [LW(p) · GW(p)], not 2010.

So evolution did need to hit upon, say, the primate architecture, in order to get to general intelligence.

Not really - many other animal species are generally intelligent as demonstrated by general problem solving ability and proto-culture (elephants seem to have burial rituals, for example), they just lack full language/culture (which is the sharp threshold transition). Also at least one species of cetacean may have language or at least proto-language (jury's still out on that), but no technology due to lack of suitable manipulators, environmental richness etc.

Its very clear that if you look at how the brain works in detail that the core architectural components of the human brain are all present in a mouse brain, just much smaller scale. The brain also just tiles simple universal architectural components to solve any problem (from vision to advanced mathematics), and those components are very similar to modern ANN components due to a combination of intentional reverse engineering and parallel evolution/convergence.

There are a few specific weaknesses of current transformer arch systems (lack of true recurrence), inference efficiency, etc but the solutions are all already in the pipes so to speak and are mostly efficiency multipliers rather than scaling discontinuities.

But that only means the sharp left turn caused by the architectural-advance part – the part we didn't yet hit upon, the part that's beyond LLMs,

So this again is EMH, not ULM - there is absolutely no architectural advance in the human brain over our primate ancestors worth mentioning, other than scale. I understand the brain deeply enough to support this statement with extensive citations (and have, in prior articles I've already linked).

Taboo 'sharp left turn' - it's an EMH term. The ULM equivalent is "Cultural Criticality" or "Culture Meta-systems Transition". Human intelligence is the result of culture - an abrupt transition from training datasets & knowledge of size O(1) human lifetime to ~O(N*T). It has nothing to do with any architectural advance. If you take a human brain and raise it by animals you just get a smart animal. The brain arch is already fully capable of advanced metalearning, but it won't bootstrap to human STEM capability without an advanced education curriculum (the cultural transmission). Through culture we absorb the accumulated knowledge /wisdom of all of our ancestors, and this is a sharp transition. But it's also a one time event! AGI won't repeat that.

It's a metasystems transition similar to the unicellular->multicellular transition.

Replies from: lahwran, Thane Ruthenis, RussellThor

↑ comment by the gears to ascension (lahwran) · 2023-12-17T18:57:08.575Z · LW(p) · GW(p)

not only is there nothing special about the human brain architecture, there is not much special about the primate brain other than hyperpameters better suited to scaling up to our size

I don't think this is entirely true. Injecting human glial cells into mice made them smarter. certainly that doesn't provide evidence for any sort of exponential difference, and you could argue it's still just hyperparams, but it's hyperparams that work better small too. I think we should be expecting sub linear growth in quality of the simple algorithms but should also be expecting that growth to continue for a while. It seems very silly that you of all people insist otherwise, given your interests.

We found that the glial chimeric mice exhibited both increased synaptic plasticity and improved cognitive performance, manifested by both enhanced long-term potentiation and improved performance in a variety of learning tasks (Han et al., 2013). In the context of that study, we were surprised to note that the forebrains of these animals were often composed primarily of human glia and their progenitors, with overt diminution in the relative proportion of resident mouse glial cells.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-12-17T19:59:08.851Z · LW(p) · GW(p)

The paper which more directly supports the "made them smarter" claim seems to be this. I did somewhat anticipate this - "not much special about the primate brain other than ..", but was not previously aware of this particular line of research and certainly would not have predicted their claimed outcome as the most likely vs various obvious alternatives. Upvoted for the interesting link.

Specifically I would not have predicted that the graft of human glial cells would have simultaneously both 1.) outcompeted the native mouse glial cells, and 2.) resulted in higher performance on a handful of interesting cognitive tests.

I'm still a bit skeptical of the "made them smarter" claim as it's always best to taboo 'smarter' and they naturally could have cherrypicked the tests (even unintentionally), but it does look like the central claim - that injection of human GPCs (glial progenitor cells) into fetal mice does result in mice brains that learn at least some important tasks more quickly, and this is probably caused by facilitation of higher learning rates. However it seems to come at a cost of higher energy expenditure, so it's not clear yet that this is a pure pareto improvement - could be a tradeoff worthwhile in larger sparser human brains but not in the mouse brain such that it wouldn't translate into fitness advantage.

Or perhaps it is a straight up pareto improvement - that is not unheard of, viral horizontal gene transfer is a thing, etc.

↑ comment by Thane Ruthenis · 2023-12-16T20:05:23.784Z · LW(p) · GW(p)

We still seem to have some disconnect on the basic terminology. The brain is a universal learning machine, okay. The learning algorithms that govern it and its architecture are simple, okay, and the genome specifies only them. On our end, we can similarly implement the AGI-complete learning algorithms and architectures with relative ease, and they'd be pretty simple. Sure. I was holding the same views from the beginning.

But on your model, what is the universal learning machine learning, at runtime? Look-up tables?

On my model, one of the things it is learning is cognitive algorithms. And different classes of training setups + scale + training data result in it learning different cognitive algorithms; algorithms that can implement qualitatively different functionality. Scale is part of it: larger-scale brains have the room to learn different, more sophisticated algorithms.

And my claim is that some setups let the learning system learn a (holistic) general-intelligence algorithm.

You seem to consider the very idea of "algorithms" or "architectures" mattering silly. But what happens when a human groks how to do basic addition, then? They go around memorizing what sum each set of numbers maps to, and we're more powerful than animals because we can memorize more numbers?

Its very clear that if you look at how the brain works in detail that the core architectural components of the human brain are all present in a mouse brain, just much smaller scale

Shrug, okay, so let's say evolution had to hit upon the Mammalia brain architecture. Would you agree with that?

Or we can expand further. Is there any taxon X for which you'd agree that "evolution had to hit upon the X brain architecture before raw scaling would've let it produce a generally intelligent species"?

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-12-16T23:14:46.009Z · LW(p) · GW(p)

But on your model, what is the universal learning machine learning, at runtime? ..

On my model, one of the things it is learning is cognitive algorithms. And different classes of training setups + scale + training data result in it learning different cognitive algorithms; algorithms that can implement qualitatively different functionality.

Yes.

And my claim is that some setups let the learning system learn a (holistic) general-intelligence algorithm.

I consider a ULM to already encompass general/universal intelligence in the sense that a properly scaled ULM can learn anything, could become a superintelligence with vast scaling, etc.

You seem to consider the very idea of "algorithms" or "architectures" mattering silly. But what happens when a human groks how to do basic addition, then? They go around memorizing what sum each set of numbers maps to, and we're more powerful than animals because we can memorize more numbers?

I think I used specifically that example earlier in a related thread: The most common algorithm most humans are taught and learn is memorization of a small lookup table for single digit addition (and multiplication), combined with memorization of a short serial mental program for arbitrary digit addition. Some humans learn more advanced 'tricks' or short cuts, and more rarely perhaps even more complex, lower latency parallel addition circuits.

Core to the ULM view is the scaling hypothesis: once you have a universal learning architecture, novel capabilities emerge automatically with scale. Universal learning algorithms (as approximations of bayesian inference) are more powerful/scalable than genetic evolution, and if you think through what (greatly sped up) evolution running inside a brain during its lifetime would actually entail it becomes clear it could evolve any specific capabilities within hardware constraints, given sufficient training compute/time and an appropriate environment (training data).

There is nothing more general/universal than that, just as there is nothing more general/universal than turing machines.

Is there any taxon X for which you'd agree that "evolution had to hit upon the X brain architecture before raw scaling would've let it produce a generally intelligent species"?

Not really - evolution converged on a similar universal architecture in many different lineages. In vertebrates we have a few species of cetaceans, primates and pachyderms which all scaled up to large brain sizes, and some avian species also scaled up to primate level synaptic capacity (and associated tool/problem solving capabilities) with different but similar/equivalent convergent architecture. Language simply developed first in the primate homo genus, probably due to a confluence of factors. But its clear that brain scale - especially specifically the synaptic capacity of 'upper' brain regions - is the single most important predictive factor in terms of which brain lineage evolves language/culture first.

But even some invertebrates (octupi) are quite intelligent - and in each case there is convergence to similar algorithmic architecture, but achieved through different mechanisms (and predecessor structures).

It is not the case that the architecture of general intelligence is very complex and hard to evolve. It's probably not more complex than the heart, or high quality eyes, etc. Instead it's just that for a general purpose robot to invent recursive turing complete language from primitive communication - that development feat first appeared only at foundation model training scale ~10^25 flops equivalent. Obviously that is not the minimum compute for a ULM to accomplish that feat - but all animal brains are first and foremost robots, and thriving at real world robotics is incredibly challenging (general robotics is more challenging than language or early AGI, as all self-driving car companies are now finally learning). So language had to bootstrap from some random small excess plasticity budget, not the full training budget of the brain.

The greatest validation of the scaling hypothesis (and thus my 2015 ULM post) is the fact that AI systems began to match human performance once scaled up to similar levels [LW · GW] of net training compute. GPT4 is at least as capable as human linguistic cortex in isolation; and matches a significant chunk of the capabilities of an intelligent human. It has far more semantic knowledge, but is weak in planning, creativity, and of course motor control/robotics. But none of that is surprising as it's still missing a few main components that all intelligent brains contain (for agentic planning/search). But this is mostly a downstream compute limitation of current GPUs and algos vs neuromorphic/brains, and likely to be solved soon.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T17:38:30.476Z · LW(p) · GW(p)

Thanks for detailed answers, that's been quite illuminating! I still disagree, but I see the alternate perspective much clearer now, and what would look like notable evidence for/against it.

↑ comment by RussellThor · 2023-12-16T20:49:17.979Z · LW(p) · GW(p)

I agree with this

- there is absolutely no architectural advance in the human brain over our primate ancestors worth mentioning, other than scale"

However how do you know that a massive advance isn't still possible, especially as our NN can use stuff such as backprop, potentially quantum algorithm to train weights and other potential advances, that simply aren't possible for nature to use? Say we figure out the brain learning algorithm, get AGI then quickly get something that uses the best of both nature and tech stuff not assessable to nature.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-12-17T22:40:50.582Z · LW(p) · GW(p)

Of course a massive advance is possible, but mostly just in terms of raw speed. The brain seems reasonably close to pareto efficiency in intelligence per watt for irreversible computers, but in the next decade or so I expect we'll close that gap as we move into more 'neuromorphic' or PIM computing (computation closer to memory). If we used the ~1e16w solar energy potential of just the Saraha desert that would support a population of trillions of brain-scale AIs or uploads running 1000x real-time.

especially as our NN can use stuff such as backprop,

The brain appears to already using algorithms similar to - but more efficient/effective - than standard backprop.

potentially quantum algorithm to train weights

This is probably mostly a nothingburger for various reasons, but reversible computing could eventually provide some further improvement, especially in a better location like buried in the lunar cold spot.

↑ comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-12-16T14:56:40.889Z · LW(p) · GW(p)

Wouldn't you expect (the many) current attempts to agentize LLMs to eat up a lot of the 'agency overhang'? Especially since, something like the reflection/planning loops of agentized LLMs seem to me like a pretty plausible description of what human brains might be doing (e.g. system 2 / system 1, or see many of Seth Herd's recent writings on agentized / scaffolded LLMs and similarities to cognitive architectures).

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-12-17T20:20:10.354Z · LW(p) · GW(p)

I don't think the current attempts have eaten the agency overhang at all. Basically none of them have worked, so the agency advantage hasn't been realized. But the public efforts just haven't put that much person-power into improving memory or executive function systems.

So I'm predicting a discontinuity in capabilities just like Thane is suggesting. I wrote another short post trying to capture the cognitive intuition: Sapience, understanding, and "AGI" [LW · GW] I think it might be a bit less sharp, since you might get an agent sort-of-working before it works really well. But the agency overhang is still there right now.

↑ comment by Seth Herd · 2023-12-17T19:56:32.099Z · LW(p) · GW(p)

All of the points you listed make AGI risk worse, but none are necessary to have major concerns about it. That's why they didn't appear in the post's summary of AGI x-risk logic.

I think this is a common and dangerous misconception. The original AGI x-risk story was wrong in many places. But that does not mean x-risk isn't real.

↑ comment by [deleted] · 2023-12-16T02:47:16.229Z · LW(p) · GW(p)

Do you have a post or blog post on the risks we do need to worry about?

Replies from: jacob_cannell, RussellThor, AliceZ

↑ comment by jacob_cannell · 2023-12-16T07:32:35.949Z · LW(p) · GW(p)

No, and that's a reasonable ask.

To a first approximation my futurism is time acceleration; so the risks are the typical risks sans AI, but the timescale is hyperexponential ala roodman. Even a more gradual takeoff would imply more risk to global stability on faster timescales than anything we've experience in history; the wrong AGI race winners could create various dystopias.

↑ comment by RussellThor · 2023-12-16T07:10:19.864Z · LW(p) · GW(p)

I can't point to such a site, however you should be aware of AI Optimists, not sure if Jacob plans to write there. Also follow the work of Quentin Pope, Alex Turner, Nora Belrose etc. I expect the site would point out what they feel to be the most important risks. I don't know of anyone rational, no matter how optimistic who doesn't think there are substantial ones.

↑ comment by ZY (AliceZ) · 2024-09-13T03:33:43.623Z · LW(p) · GW(p)

If you meant for current LLMs, some of them could be misuse of current LLM by humans, or risks such as harmful content, harmful hallucination, privacy, memorization, bias, etc. For some other models such as ranking/multiple ranking, I have heard some other worries on deception as well (this is only what I recall of hearing, so it might be completely wrong).

comment by TurnTrout · 2023-12-16T21:41:38.823Z · LW(p) · GW(p)

It seems to me that you have very high confidence in being able to predict the "eventual" architecture / internal composition of AGI. I don't know where that apparent confidence is coming from.

The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.

I would instead say:

The canonical views dreamed up systems which don't exist, which have never existed, and which might not ever exist.^[1] Given those assumptions, some people have drawn strong conclusions about AGI risk.

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying "does current evidence apply to 'superintelligences'?", I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!".

^{^}
The views might have, for example, fundamentally misunderstood how cognition and motivation work (anyone remember worrying about 'but how do I get an AI to rescue my mom from a burning building, without specifying my whole set of values'?).

Replies from: Thane Ruthenis, lahwran, Vladimir_Nesov, Spencer Becker-Kahn, sharmake-farah, valley9

↑ comment by Thane Ruthenis · 2023-12-17T05:44:56.347Z · LW(p) · GW(p)

We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

As I'd tried to outline in the post, I think "what are AIs that are known to exist, and what properties do they have?" is just the wrong question to focus on. The shared "AI" label is a red herring. The relevant question is "what are scarily powerful generally-intelligent systems that exist, and what properties do they have?", and the only relevant data point seems to be humans.

And as far as omnicide risk is concerned, the question shouldn't be "how can you prove these systems will have the threatening property X, like humans do?" but "how can you prove these systems won't have the threatening property X, like humans do?".

Replies from: TurnTrout

↑ comment by TurnTrout · 2023-12-26T19:07:13.546Z · LW(p) · GW(p)

I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it's really clear why we're using one frame. This is a huge peril of reasoning by analogy.

Whenever attempting to draw conclusions by analogy, it's important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there's a shared mechanism of "local updating via self-supervised and TD learning on ~randomly initialized neural networks" which leads to things like "contextually activated heuristics" (or "shards").

Here, it isn't clear what the shared mechanism is supposed to be, such that both (future) AI and humans have it. Suppose I grant that if a system is "smart" and has "goals", then bad things can happen. Let's call that the "bad agency" hypothesis.

But how do we know that future AI will have the relevant cognitive structures for "bad agency" to be satisfied? How do we know that the AI will have internal goal representations which chain into each other across contexts, so that the AI reliably pursues one or more goals over time? How do we know that the mechanisms are similar enough for the human->AI analogy to provide meaningful evidence on this particular question?

I expect there to be "bad agency" systems eventually, but it really matters what kind we're talking about. If you're thinking of "secret deceptive alignment that never externalizes in the chain-of-thought" and I'm thinking about "scaffolded models prompted to be agentic and bad", then our interventions will be wildly different.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-26T21:15:18.327Z · LW(p) · GW(p)

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion

Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue.

Here's how the whole situation looks like from my perspective:

We don't know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it's not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
SOTA AIs are, nevertheless, superhuman at some tasks at which we've managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they'd plausibly wipe out whole industries.
An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
The AI industry leaders are purposefully trying to build a generally-intelligent AI.
The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it's not going to give their model room to develop deceptive alignment and other human-like issues.
Summing up: There's reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
- Even shorter: There's a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
- There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".
What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".

Yes, "prove that this technological advance isn't going to kill us all or you're not allowed to do it" is a ridiculous standard to apply in the general case. But in this one case, there's a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-01-01T20:46:01.495Z · LW(p) · GW(p)

And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".

No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that's important. I think it's important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn't mean it's fine and dandy to keep scaling with no concern at all.

The reason my percentage is "only 5 to 15" is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.

(Hopefully this comment of mine clarifies; it feels kinda vague to me.)

What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".

But I do think this is way too high of a bar.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2024-01-04T09:45:23.042Z · LW(p) · GW(p)

No, I am in fact quite worried about the situation

Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.

I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures

Would you outline your full argument for this and the reasoning/evidence backing that argument?

To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model [LW · GW]).

Would you disagree? If yes, how so?

↑ comment by the gears to ascension (lahwran) · 2023-12-17T18:48:01.554Z · LW(p) · GW(p)

How about diffusion planning as a model? Or dreamerv3? If LLMs are the only model you'll consider, you have blinders on. The core of the threat model is easily demonstrated with RL-first models, and while certainly LLMs are in the lead right now, there's no strong reason to believe the humans trying to make the most powerful AI will continue to use architectures limited by the slow speed of RLHF.

Certainly I don't think the original foom expectations were calibrated. Deep learning should have been obviously going to win since at least 2015. But that doesn't mean there's no place for a threat model that looks like long term agency models, all that takes to model is long horizon diffusion planning. Agency also comes up more the more RL you do. You added an eye roll react to my comment that RLHF is safety washing, but do you really think we're in a place where the people providing the RL feedback can goalcraft AI in a way that will be able to prevent humans from getting gentrified out of the economy? That's just the original threat model but a little slower. So yeah, maybe there's stuff to push back on. But don't make your conceptual brush size too big when you push back. Predictable architectures are enough to motivate this line of reasoning.

↑ comment by Vladimir_Nesov · 2023-12-17T14:57:22.985Z · LW(p) · GW(p)

"under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?"

Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn't care about currently available systems and investigates whatever hasn't been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn't care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it's not usually the point).

↑ comment by carboniferous_umbraculum (Spencer Becker-Kahn) · 2024-01-18T11:26:23.251Z · LW(p) · GW(p)

FWIW I did not interpret Thane as necessarily having "high confidence" in "architecture / internal composition" of AGI. It seemed to me that they were merely (and ~accurately) describing what the canonical views were most worried about. (And I think a discussion about whether or not being able to "model the world" counts as a statement about "internal composition" is sort of beside the point/beyond the scope of what's really being said)

It's fair enough if you would say things differently(!) but in some sense isn't it just pointing out: 'I would emphasize different aspects of the same underlying basic point'. And I'm not sure if that really progresses the discussion? I.e. it's not like Thane Ruthenis actually claims that "scarily powerful artificial agents" currently exist. It is indeed true that they don't exist and may not ever exist. But that's just not really the point they are making so it seems reasonable to me that they are not emphasizing it.

----

I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!".

I think I would also like to see more thought about this. In some ways, after first getting into the general area of AI risk, I was disappointed that the alignment/safety community was not more focussed on questions like this. Like a lot of people, I'd been originally inspired by Superintelligence - significant parts of which relate to these questions imo - only to be told that the community had 'kinda moved away from that book now'. And so I sort of sympathize with the vibe of Thane's post (and worry that there has been a sort of mission creep)

↑ comment by Noosphere89 (sharmake-farah) · 2023-12-17T00:41:48.555Z · LW(p) · GW(p)

"why do we think 'future architectures' will have property X, or whatever?!".

This is the biggest problem with a lot of AI risk stuff, and it's the gleeful assuming that AIs have certain properties, and it's one of my biggest issues with the post, in that with a few exceptions, it assumes that real AGIs or future AGIs will confidently have certain properties, when there is not much reason to make the strong assumptions that Thane Ruthenis does on AI safety, and I'm annoyed by this occurring extremely often.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2023-12-17T03:16:35.021Z · LW(p) · GW(p)

it assumes that real AGIs or future AGIs will confidently have certain properties like having deceptive alignment

The post doesn't claim AGIs will be deceptive aligned, it claims that AGIs will be capable of implementing deceptive alignment due to internally doing large amounts of consequentialist-y reasoning. This seems like a very different claim. This claim might also be false (for reasons I discuss in the second bullet point of this comment [LW(p) · GW(p)]), but it's importantly different and IMO much more defensible.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2023-12-17T03:25:39.471Z · LW(p) · GW(p)

I was just wrong here, apparently, I misread what Thane Ruthenis is saying, and I'm not sure what to do with my comment up above.

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-16T23:16:14.708Z · LW(p) · GW(p)

I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!".

One of my mental models for alignment work is "contingency planning". There are a lot of different ways AI research could go. Some might be dangerous. Others less so. If we can forecast possible dangers in advance, we can try to steer towards safer designs, and generate contingency plans with measures to take if a particular forecast for AI development ends up being correct.

The risk here is "person with a hammer" syndrome, where people try to apply mental models from thinking about superintelligent consequentialists to other AI systems in a tortured way, smashing round pegs into square holes. I wish people would look at the territory more, and do a little bit more blue sky security thinking about unknown unknowns, instead of endlessly trying to apply the classic arguments even when they don't really apply.

A specific research proposal would be: Develop a big taxonomy or typology of how AGI could work by identifying the cruxes researchers have, then for each entry in your typology, give it an estimated safety rating, try to identify novel considerations which apply to it, and also summarize the alignment proposals which are most promising for that particular entry.

comment by Garrett Baker (D0TheMath) · 2023-12-16T01:48:59.109Z · LW(p) · GW(p)

It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI.

My understanding is that many of those studying MNIST-classifier CNNs circa 2010 were in fact studying this because they believed similar neural-net inspired mechanisms would go much further, and would not be surprised if very similar mechanisms were at play inside LLMs. And they were correct! Such studies led to ReLU, backpropagation, residual connections, autoencoders for generative AI, and ultimately the scaling laws we see today.

If you traveled back to 2010, and you had to choose between already extant fields, having that year's GPU compute prices and software packages, what would you study to learn about LLMs? Probably neural networks in general, both NLP and image classification. My understanding is there was & is much cross-pollination between the two.

Of course, maybe this is just a misunderstanding of history on my part. Interested to hear if my understanding's wrong!

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T05:58:06.820Z · LW(p) · GW(p)

After an exchange with Ryan [LW(p) · GW(p)], I see that I could've stated my point a bit clearer. It's something more like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".

So, yes, studying weaker AIs sheds some light on stronger ones (that's why there's "nearly" in "nearly no data"), so studying CNNs in order to learn about LLMs before LLMs exist isn't totally pointless. But the lessons you learn would be more about "how to do interpretability on NN-style architectures" and "what's the SGD's biases?" and "how precisely does matrix multiplication implement algorithms?" and so on.

Not "what precise algorithms does a LLM implement?".

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-16T10:24:51.659Z · LW(p) · GW(p)

the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an actual AGI's cognition, so extrapolating the limitations of their cognition to future AGI is a bold claim we have little evidence for

I suggest putting this at the top as a tl;dr (with the additions I bolded to make your point more clear)

comment by Thomas Kwa (thomas-kwa) · 2023-12-16T23:21:24.303Z · LW(p) · GW(p)

"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.

The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.

It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T05:56:41.462Z · LW(p) · GW(p)

Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone

Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion.

If someone from a pre-industrial tribe adopts a stray puppy from a nearby technological civilization, and the puppy grows up to be loyal to the tribe, you say that's evidence the technological civilization isn't planning to genocide the tribe for sitting on some resources it wants to extract?

That seems, in fact, like the precise situation in which my post's arguments apply most strongly. Just because two systems are in the same reference class ("AIs", "alien life", "things that live in that scary city over there"), doesn't mean aligning one tells you anything about aligning the other.

Replies from: thomas-kwa

↑ comment by Thomas Kwa (thomas-kwa) · 2023-12-17T09:25:57.696Z · LW(p) · GW(p)

Some thoughts:

I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic.
- But probably these will ~~depend on~~ descend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
- Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc.
In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
- Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans.
- There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world!

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T10:27:54.005Z · LW(p) · GW(p)

But probably these will depend on current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.

Mm, we disagree on that, but it's probably not the place to hash this out.

In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.

Uncertainty lives in the mind. Let's say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren't quite sure what the hell they're looking at when they look at them. They snatch up the puppy, which we'll say is also a cyborg, so it's not obvious to the tribe that it's not a member of the city's ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city's ruling class must likewise not be that bad. In the meantime, the city's dictator is already ordering to depopulate the region of native presence.

How does that analogy break down, in your view?

Replies from: thomas-kwa

↑ comment by Thomas Kwa (thomas-kwa) · 2024-01-04T21:36:37.704Z · LW(p) · GW(p)

Behaving nicely is not the key property I'm observing in LLMs. It's more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator's happiness, I'd be far more terrified.
This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
I don't think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it's like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one's updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
- If you haven't started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don't think such confidence is justified.

comment by Garrett Baker (D0TheMath) · 2023-12-15T22:59:04.465Z · LW(p) · GW(p)

They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.

I disagree with this, and I think you should too, even considering your own views. For example, DeepMind recently discovered 2.2 million new crystals, increasing the number of stable crystals we know about by an order of magnitude. Perhaps you don't think this is revolutionary, but 5, 10, 15, 50 more papers like it? One of them is bound to be revolutionary.

Maybe you don't think this is autonomous enough for you. After all its people writing the paper, people who will come up with the ideas of what to use the materials for, and people who built this very particular ML setup in the first place. But then your prediction becomes these tasks will not be automateable by LLMs without making them dangerous. To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.

(edit note: last sentence was edited in)

Replies from: Thane Ruthenis, Gunnar_Zarncke, None

↑ comment by Thane Ruthenis · 2023-12-16T06:11:22.553Z · LW(p) · GW(p)

Maybe you don't think this is autonomous enough for you

Yep. The core thing here is iteration. If an AI can execute a whole research loop on its own – run into a problem it doesn't know how to solve, figure out what it needs to learn to solve it, construct a research procedure for figuring that out, carry out that procedure, apply the findings, repeat – then research-as-a-whole begins to move at AI speeds. It doesn't need to wait for a human to understand the findings and figure out where to point it next – it can go off and invent whole new fields at inhuman speeds.

Which means it can take off; we can meaningfully lose control of it. (Especially if it starts doing AI research itself.)

Conversely, if there's a human in the loop, that's a major bottleneck. As I'd mentioned in the post, I think LLMs and such AIs are a powerful technology, and greatly boosting human research speeds is something where they could contribute. But without a fully closed autonomous loop, that's IMO not an omnicide risk.

To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.

That's a point of disagreement: I don't think GPT-N would be able to do it. I think this post [LW · GW] by Nate Soares mostly covers the why. Oh, or this post [LW · GW] by janus.

I don't expect LLMs to be able to keep themselves "on-target" while choosing novel research topics or properly integrating their findings. That's something you need proper context-aware consequentialist-y cognition for. It may seem trivial – see janus' post pointing out that "steering" cognition basically amounts to just injecting a couple decision-bits at the right points – but that triviality is deceptive.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-12-16T09:39:22.976Z · LW(p) · GW(p)

Ok, so if I get a future LLM to write the code to use standard genai tricks to generate novel designs in <area>, write a paper about the results, and the paper is seen as a major revolution in <area>, and this seems to not violate the assumptions Nora and Quintin are making during doom arguments, would this update you? What constraints do you want to put on <area>?

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T09:50:08.001Z · LW(p) · GW(p)

Nope, because of the "if I get a future LLM to [do the thing]" step. The relevant benchmark is the AI being able to do it on its own. Note also how your setup doesn't involve the LLM autonomously iterating on its discovery, which I'd pointed out as the important part.

To expand on that:

Consider an algorithm that generates purely random text. If you have a system consisting of trillions of human uploads using it, each hitting "rerun" a million times per second, and then selectively publishing only the randomly-generated outputs that are papers containing important mathematical proofs – well, that's going to generate novel discoveries sooner or later. But the load-bearing part isn't the random-text algorithm, it's the humans selectively amplifying those of its outputs that make sense.

LLM-based discoveries as you've proposed, I claim, would be broadly similar. LLMs have a better prior on important texts than a literal uniform distribution, and they could be prompted to further be more likely to generate something useful, which is why it won't take trillions of uploads and millions of tries. But the load-bearing part isn't the LLM, it's the human deciding where to point its cognition and which result to amplify.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-12-16T10:10:07.741Z · LW(p) · GW(p)

Paragraph intended as a costly signal I am in fact invested in this conversation, no need to actually read: Sorry for the low effort replies, but by its nature the info I want from you is more costly for you to give than for me to ask for. Thanks for the response, and hopefully thanks also for future responses.

I feel like I’d always be getting an LLM to do something. Like, if I get an LLM to do the field selection for me, does this work?

Maybe more open-endedly: what, concretely, is the closest thing to what I said that would make you update?

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T11:00:57.166Z · LW(p) · GW(p)

Maybe more open-endedly: what, concretely, is the closest thing to what I said that would make you update?

Oh, nice way to elicit the response you're looking for!

The baseline proof-of-concept would go as follows:

You give the AI some goal, such as writing an analytical software intended to solve some task.
The AI, over the course of writing the codebase, runs into some non-trivial, previously unsolved mathematical problem. Some formulas need to be tweaked to work in the new context, or there's some missing math theory that needs to be derived.
The AI doesn't hallucinate solutions or swap-in the closest (and invalid) analogue. Instead, it correctly identifies that a problem exists, figures out how it can approach solving it, and goes about doing this.
As it's deriving new theory, it sometimes runs into new sub-problems. Likewise, it doesn't hallucinate solutions, but spins off some subtasks, and solves sub-problems in them.
Ideally, it even defines experiments or rigorous test procedures for fault-checking its theory empirically.
In the end, it derives a whole bunch of novel abstractions/functions/terminology, with layers of novel abstractions building up on the preceding layers of novel abstractions, and all of that is coherently optimized to fit into the broader software-engineering task it's been given.
The software works. It doesn't need to be bug-free, the theory doesn't need to be perfect, but it needs to be about as good as a human programmer would've managed, and actually based on some novel derivations.

This seems like something an LLM, e. g. in an AutoGPT wrapper, should be able to do, if its base model is generally intelligent

I am a bit wary of reality Goodharting on this test, though. E. g., I can totally imagine some specific niche field in which an LLM, for some reason, can do this, but can't do it anywhere else. Or some fuzziness around what counts as "novel math" being exploited – e. g., if the AI happens to hit upon re-applying extant math theory to a different field? Or, even more specifically, that there's some specific research-engineering task that some LLM somewhere manages to ace, but in an one-off manner?

So I would fortify this a bit: individual or isolated instances don't count. AIs should be broadly known to be able to engage in this sort of stuff. That should be happening frequently, without much optimization and tailoring made on the human end; about as easily as GPT-4 could be tasked to write a graduate-level essay.

It's fine if they can't do that for literally any field. But it should be a "blacklist" of fields, not a "whitelist" of fields.

So if we get an AI model that can do this, and it's based on something relevantly similar to the current paradigm, and it doesn't violate the LLM-style safety guarantees, I think that would be significant evidence against my model.

Replies from: D0TheMath, D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-12-16T22:12:43.400Z · LW(p) · GW(p)

Maybe a more relevant concern I have with this is it feels like a "Can you write a symphony" type test to me. Like, there are very few people alive right now who could do the process you outline without any outside help, guidance, or prompting.

Replies from: Thane Ruthenis, D0TheMath, Bezzi

↑ comment by Thane Ruthenis · 2023-12-17T05:49:30.096Z · LW(p) · GW(p)

Yeah, it's necessarily a high bar. See justification here [LW(p) · GW(p)].

I'm not happy about only being able to provide high-bar predictions like this, but it currently seems to me to be a territory-level problem.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-12-17T23:28:45.820Z · LW(p) · GW(p)

It really seems like there should be a lower bar to update though. Like, you say to consider humans as an existence proof of AGI, so likely your theory says something about humans. There must be some testable part of everyday human cognition which relies on this general algorithm, right?

Like, at the very least, what if we looked at fMRIs of human brains while they were engaging in all the tasks you laid out above, and looked at some similarity metric between the scans? You would probably expect there to be lots of similarity compared to, possibly, say Jacob Cannell or Quintin Pope's predictions. Right?

Even if you don't think one similarity metric could cover it, you should still be able to come up with some difference of predictions, even if not immediately right now.

Edit: Also I hope you forgive me for not asking for a prediction of this form earlier. It didn't occur to me.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-18T05:55:36.641Z · LW(p) · GW(p)

There must be some testable part of everyday human cognition which relies on this general algorithm, right?

Well, yes, but they're of a hard-to-verify "this is how human cognition feels like it works" format. E. g., I sometimes talk about how humans seem to be able to navigate unfamiliar environments without experience [LW · GW], in a way that seems [LW(p) · GW(p)] to disagree [LW · GW] with baseline shard-theory predictions. But I don't think that's been persuading people not already inclined to this view.

The magical number 7±2 and the associated weirdness [LW(p) · GW(p)] is also of the relevant genre.

Like, at the very least, what if we looked at fMRIs of human brains while they were engaging in all the tasks you laid out above, and looked at some similarity metric between the scans?

Hm, I guess something like this might work? Not sure regarding the precise operationalization, though.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2023-12-26T21:40:19.124Z · LW(p) · GW(p)

Hm, I guess something like this might work? Not sure regarding the precise operationalization, though.

You willing to do a dialogue about predictions here with @jacob_cannell [LW · GW] or @Quintin Pope [LW · GW] or @Nora Belrose [LW · GW] or others (also a question to those pinged)?

Replies from: Thane Ruthenis, quintin-pope

↑ comment by Thane Ruthenis · 2023-12-26T22:41:46.221Z · LW(p) · GW(p)

If any of the others are particularly enthusiastic about this and expect it to be high-value, sure!

That said, I personally don't expect it to be particularly productive.

These sorts of long-standing disagreements haven't historically been resolvable via debate [LW · GW] (the failure of Hanson vs. Yudkowsky is kind of foundational to the field).
I think there's great value in having a public discussion nonetheless, but I think it's in informing the readers' models of what different sides believe.
Thus, inasmuch as we're having a public discussion, I think it should be optimized for thoroughly laying out one's points to the audience.
However, dialogues-as-a-feature seem to be [LW(p) · GW(p)] more valuable to the participants, and are actually harder to grok for readers.
Thus, my preferred method for discussing this sort of stuff is to exchange top-level posts trying to refute each other (the way this post is, to a significant extent, a response to the AI is easy to control article), and then maybe argue a bit in the comments. But not to have a giant tedious top-level argument.

I'd actually been planning to make a post about the difficulties the "classical alignment views" have with making empirical predictions, and I guess I can prioritize it more?

But I'm overall pretty burned out on this sort of arguing. (And arguing about "what would count as empirical evidence for you?" generally feels like too-meta fake work, compared to just going out and trying to directly dredge up some evidence.)

↑ comment by Quintin Pope (quintin-pope) · 2023-12-26T21:57:32.843Z · LW(p) · GW(p)

Not entirely sure what @Thane Ruthenis [LW · GW]' position is, but this feels like a maybe relevant piece of information: https://www.science.org/content/article/formerly-blind-children-shed-light-centuries-old-puzzle

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-26T22:16:29.294Z · LW(p) · GW(p)

Not sure what the relevance is? I don't believe that "we possess innate (and presumably God-given) concepts that are independent of the senses", to be clear. "Children won't be able to instantly understand how to parse a new sense and map its feedback to the sensory modalities they've previously been familiar with, but they'll grok it really fast with just a few examples" was my instant prediction upon reading the titular question.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-12-27T03:10:42.810Z · LW(p) · GW(p)

I also not sure of the relevance and not following the thread fully, but the summary of that experiment is that it takes some time (measured in nights of sleep which are rough equivalent of big batch training updates) for the newly sighted to develop vision, but less time than infants - presumably because the newly sighted already have full functioning sensor inference world models in another modality that can speed up learning through dense top down priors.

But its way way more than "grok it really fast with just a few examples" - training their new visual systems still takes non-trivial training data & time

↑ comment by Garrett Baker (D0TheMath) · 2023-12-17T00:00:42.744Z · LW(p) · GW(p)

Though, admittedly, the prompt was to modify the original situation I presented, which had an output currently very difficult for any human to produce to begin with. So I don't quite fault you for responding in kind.

↑ comment by Bezzi · 2023-12-16T23:21:01.832Z · LW(p) · GW(p)

Well, for what's worth, I can write a symphony (following the traditional tonal rules), as this is actually mandated in order to pass some advanced composition classes. I think that letting the AI write a symphony without supervision and then make some composition professor evaluate it could actually be a very good test, because there's no way a stochastic parrot could follow all the traditional rules correctly for more than a few seconds (an even better test would be to ask it to write a fugue on a given subject, whose rules are even more precise).

↑ comment by Garrett Baker (D0TheMath) · 2023-12-16T17:50:47.639Z · LW(p) · GW(p)

So I would fortify this a bit: individual or isolated instances don't count. AIs should be broadly known to be able to engage in this sort of stuff. That should be happening frequently, without much optimization and tailoring made on the human end; about as easily as GPT-4 could be tasked to write a graduate-level essay.

I think sticking to this would make it difficult for you to update sooner. We should expect small approaches before large approaches here, and private solutions before publicly disclosed solutions.

Relatedly would DeepMind’s recent LLM mathematical proof paper if it were more general count? They give LLMs feedback via an evaluator function, exploiting the NP hard nature of a problem in combinatorics and bin packing (note: I have not read this paper in full).

↑ comment by Gunnar_Zarncke · 2023-12-15T23:37:59.714Z · LW(p) · GW(p)

They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.

You say it yourself: "DeepMind recently discovered 2.2 million new crystals." Because a human organization used the tool.

Though maybe this hints at a risk category the OP didn't mention: That a combination of humans and advanced AI tools (that themselves are not ASI) together could be effectively an unopposable ASI.

Replies from: D0TheMath, Thane Ruthenis

↑ comment by Garrett Baker (D0TheMath) · 2023-12-15T23:41:55.615Z · LW(p) · GW(p)

So I restate my final paragraph:

Maybe you don't think this is autonomous enough for you. After all its people writing the paper, people who will come up with the ideas of what to use the materials for, and people who built this very particular ML setup in the first place. But then your prediction becomes these tasks will not be automateable by LLMs without making them dangerous. To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.

↑ comment by Thane Ruthenis · 2023-12-16T06:32:37.059Z · LW(p) · GW(p)

a combination of humans and advanced AI tools (that themselves are not ASI) together could be effectively an unopposable ASI

Yeah, I'm not unworried about eternal-dystopia scenarios enabled by this sort of stuff. I'd alluded to it some, when mentioning scaled-up LLMs potentially allowing "perfect-surveillance dirt-cheap totalitarianism".

But it's not quite an AGI killing everyone. Fairly different threat model, deserving of its own analysis.

↑ comment by [deleted] · 2023-12-15T23:17:36.101Z · LW(p) · GW(p)

I also thought this. Then we run a facility full of robots and have them synthesize and measure the material properties of all 2.2 million crystals. Replication is cheap and would be automatically done so we don't waste time on materials that seem good due to an error.

Then a human scientist writes a formula that takes into account several properties for suitability to a given tasks, sorts the spreadsheet of results by the formula, orders built a new device using the top scoring materials, writes a paper with the help of a gpt, publishes and collects the rewards for this amazing new discovery.

So I think the OP is thinking that last 1 percent or 0.1 percent contributed by the humans means the model isn't fully autonomous? And I have seen a kinda bias on lesswrong where many posters went to elite schools and do elite work and they don't realize all the other people that are needed for anything to be done. For example every cluster of a million GPUs requires a large crew of technicians and all the factory workers and engineers who designed and built all the hardware.

In terms of human labor hours, 10 AI researchers using a large cluster are greatly outnumbered by the other people involved they don't see. Possibly thousands of other people working full time when you start considering billion dollar clusters, if just 20 percent of that was paying for human labor at the average salary weighted by Asia.

This means ai driven autonomy can be transformational even if the labor of the most elite workers can't be done by AI.

In numbers, if just 1 of those AI researchers can be automated, but 90 percent of the factory workers and mine workers, and the total crew was 1000 people including all the invisible contributors in Asia, then for the task of AI research it needs 109 people instead of 1000.

But from the OPs perspective, the model hasn't automated much, you need 9 elite researchers instead of 10. And actually the next generation of AI is more complex so you hire more people and less new ideas are working as low hanging fruit are plucked. If you focus on just elite contributors, only the most powerful AI can be transformational. I have noticed this bias from several prominent lesswrong posters.

Replies from: Morpheus, D0TheMath, Thane Ruthenis

↑ comment by Morpheus · 2023-12-16T01:03:07.962Z · LW(p) · GW(p)

I am confused. I agree with the above scenario, but disagree that the focus is a bias. Sure, for human society the linear speed-up scale is important, but for the dynamics of the intelligence explosion the log-scale seems more important. By your own account, we would rapidly move to a situation, where the most capable humans/institutions are in fact the bottleneck. As anyone who is not able to keep up with the speed of their job being automated away is not going to contribute a lot on the margin of intelligence self-improvement. For example, OpenAI/Microsoft/Deepmind/Anthropic/Meta deciding in the future to design and manufacture their chips in house, because NVIDIA can't keep up etc… I don’t know if I expect this would make NVIDIA's stock tank before the world ends. I expect everyone else to profit from slowly generating mundane utility from general AI tools, as is happening today.

Replies from: None

↑ comment by [deleted] · 2023-12-16T01:08:12.727Z · LW(p) · GW(p)

Here's another aspect you may not have considered. "Only" being able to automate the lower 90-99 percent of human industrial tasks results in a conventional industry explosion. Scaling continue until the 1-10 percent of humans still required is the limiting factor.

A world that has 10 to 100 times today's entire capacity for everything (that means consumer goods, durable goods like cars, weapons, structures if factory prefab) is transformed.

And this feeds back into itself like you realize, the crew of AI researchers trying to automate themselves now has a lot more hardware to work with etc.

↑ comment by Garrett Baker (D0TheMath) · 2023-12-16T00:17:31.664Z · LW(p) · GW(p)

This seems overall consistent with Thane's statements in the post? They don't make any claims about current AIs not being a transformative technology. Indeed, they do state that current AIs are a powerful technology.

Replies from: None

↑ comment by [deleted] · 2023-12-16T00:24:42.503Z · LW(p) · GW(p)

Third and last paragraph I try to explain why the OP and prominent experts like Matthew Barnett and Richard Ngos and others all model much harder standards for when AI will be transformative.

For a summary: advancing technology is mostly perspiration not inspiration, automating the perspiration will be transformative.

↑ comment by Thane Ruthenis · 2023-12-16T06:14:08.470Z · LW(p) · GW(p)

This means ai driven autonomy can be transformational even if the labor of the most elite workers can't be done by AI.

Oh, totally. But I'm not concerned about transformations to the human society in general, I'm concerned about AGI killing everyone. And what you've described isn't going to lead to AGI killing everyone.

See my reply here [LW(p) · GW(p)] for why I think complete autonomy is crucial.

comment by Ebenezer Dukakis (valley9) · 2023-12-16T11:26:17.084Z · LW(p) · GW(p)

Your view may have a surprising implication: Instead of pushing for an AI pause, perhaps we should work hard to encourage the commercialization of current approaches.

If you believe that LLMs aren't a path to full AGI, successful LLM commercialization means that LLMs eat low-hanging fruit and crowd out competing approaches which could be more dangerous. It's like spreading QWERTY as a standard if you want everyone to type a little slower. If tons of money and talent is pouring into an AI approach that's relatively neutered and easy to align, that could actually be a good thing.

A toy model: Imagine an economy where there are 26 core tasks labeled from A to Z, ordered from easy to hard. You're claiming that LLMs + CoT provide a path to automate tasks A through Q, but fundamental limitations mean they'll never be able to automate tasks R through Z. To automate jobs R through Z would require new, dangerous core dynamics. If we succeed in automating A through Q with LLMs, that reduces the economic incentive to develop more powerful techniques that work for the whole alphabet. It makes it harder for new techniques to gain a foothold, since the easy tasks already have incumbent players. Additionally, it will take some time for LLMs to automate tasks A through Q, and that buys time for fundamental alignment work.

From a policy perspective, an obvious implication is to heavily tax basic AI research, but have a more favorable tax treatment for applications work (and interpretability work?) That encourages AI companies to allocate workers away from dangerous new ideas and towards applications work. People argue that policymakers can't tell apart good alignment schemes and bad alignment schemes. Differentiating basic research from applications work seems a lot easier.

A lot of people in the community want to target big compute clusters run by big AI companies, but I'm concerned that will push researchers to find alternative, open-source approaches with dangerous/unstudied core dynamics. "If it ain't broke, don't fix it." If you think current popular approaches are both neutered and alignable, you should be wary of anything which disrupts the status quo.

(Of course, this argument could fail if successful commercialization just increases the level of "AI hype", where "AI hype" also inevitably translates into more basic research, e.g. as people migrate from other STEM fields towards AI. I still think it's an argument worth considering though.)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T11:34:19.686Z · LW(p) · GW(p)

That's not surprising to me! I pretty much agree with all of this, yup. I'd only add that:

This is why I'm fairly unexcited about the current object-level regulation, and especially the "responsible scaling policies". Scale isn't what matters, novel architectural advances is. Scale is safe, and should be encouraged; new theoretical research is dangerous and should be banned/discouraged.
The current major AI labs are fairly ideological about getting to AGI specifically. If they actually pivoted to just scaling LLMs, that'd be great, but I don't think they'd do it by default.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-12-17T20:34:07.663Z · LW(p) · GW(p)

I agree that LLMs aren't dangerous. But that's entirely separate from whether they're a path to real AGI that is. I think adding self-directed learning and agency to LLMs by using them in cognitive architectures is relatively straightforward: Capabilities and alignment of LLM cognitive architectures [AF · GW].

On this model, improvements in LLMs do contribute to dangerous AGI. They need the architectural additions as well, but better LLMs make those easier.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-12-16T04:57:06.663Z · LW(p) · GW(p)

I see people discussing how far we can go with LLM or other simulator/predictor systems. I particularly like porby's takes on this. I am excited for that direction of research, but I great it misses an important piece. The missing piece is this: There will consistently be a set of tasks that, with any given predictor skill level, are easier to achieve with that predictor wrapped in an agent-layer. AutoGPT is tempting for a real reason. There is significant reward available to those who successfully integrate the goal-less predictor into a goal-pursuing agent program. To avoid this, you must convince everyone who could do this not to do this. This could be by convincing them it wouldn't be profitable after all, or would be too dangerous, or that enforcement mechanisms will stop them. Unless you manage to do this convincing for all possible people in a position to do this, then someone does it. And then you have to deal with the agent-thing. What I'm saying is that you can't count on there never being the agent version. You have to assume that someone will try it. So the argument, "we can get lots of utility much more safely from goal-less predictors" can be true and yet we will still need a plan for handling the agentive systems. If your argument is that we can use goal-less predictors and narrow tool AI to shepard us through the dangerous period of wide availability of AI systems that can easily be turned into self-improving goal-pursuing resource-accumulating agents... Great. Then discuss how to use narrow AI as a mallet to play Whack-a-Mole with the rogue agentic AIs we expect to be popping up everywhere. Don't pretend like the playing field won't have those rogue AIs at all because you've argued that they aren't wise or necessary.

Replies from: valley9, None

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-16T10:56:13.891Z · LW(p) · GW(p)

I don't think the mere presence of agency means that all of the classical arguments automatically start to apply. For example, I'm not immediately seeing how Goodhart's Law is a major concern with AutoGPT, even though AutoGPT is goal-directed.

AutoGPT seems like a good architecture for something like "retarget the search [LW · GW]", since the goal-directed aspect is already factored out nicely. A well-designed AutoGPT could leverage interpretability tools and interactive querying to load your values in a robust way, with minimal worry that the system is trying to manipulate you to achieve some goal-driven objective during the loading process.

Thinking about it, I actually see a good case for alignment people getting jobs at AutoGPT. I suspect a bit of security mindset could go a long way in its architecture. It could also be valuable as differential technological development, to ward off scenarios where people are motivated to create dangerous new core dynamics in order to subvert current LLM limitations.

Replies from: Seth Herd, nathan-helm-burger, Thane Ruthenis

↑ comment by Seth Herd · 2023-12-17T20:42:51.876Z · LW(p) · GW(p)

I agree that things like AutoGPT are an ideal architecture for something exactly like retarget the search. I've noted that same similarity in Steering subsystems: capabilities, agency, and alignment [LW · GW] and a stronger similarity in an upcoming post. In Internal independent review for language model agent alignment [AF · GW] I note the alignment advantages you list, and a couple of others.

Current AutoGPT is simply too incompetent to effectively pursue a goal. Other similar systems are more competent (the two Minecraft LLM agent systems are the most impressive), but nobody has let them run ad infinitum to test their Goodharting. I'd assume they'd show it. Goodhart will apply increasingly as those systems actually pursue goals.

AutoGPT isn't a company, it's a little open-source project. Any companies working on agents aren't publicizing their work so far.

I do suspect that actively improving things like AutoGPT is a good route to addressing x-risk because of their advantages for alignment. But I'm not sure enough to start advocating it.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-17T21:26:05.488Z · LW(p) · GW(p)

AutoGPT isn't a company, it's a little open-source project. Any companies working on agents aren't publicizing their work so far.

They raise $12M: https://twitter.com/Auto_GPT/status/1713009267194974333

You could be right that they haven't incorporated as a company. I wasn't able to find information about that.

Replies from: Seth Herd

↑ comment by Seth Herd · 2023-12-17T21:52:49.662Z · LW(p) · GW(p)

Wow, interesting. The say it will be the largest open-source project in history. I have no idea how an open-source project raises $12m but they did.

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-12-16T15:10:48.791Z · LW(p) · GW(p)

Fair point, valley9. I don't think a little bit of agency throws you into an entirely different regime. It's more that I think that the more powerful an agent you build, the more it is able to autonomously change the world to work with goals, the more you move into dangerous territory. But also, it's going to tempt people. Somebody out there is going to be tempted to say, "go make me money, just don't get caught doing anything illegal in a way that gets traced back to me." That command given to a sufficiently powerful AI system could have a lot of dangerous results.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-16T17:59:04.172Z · LW(p) · GW(p)

But also, it's going to tempt people. Somebody out there is going to be tempted to say, "go make me money, just don't get caught doing anything illegal in a way that gets traced back to me." That command given to a sufficiently powerful AI system could have a lot of dangerous results.

Indeed. This seems like more of a social problem than an alignment problem though: ensure that powerful AIs tend to be corporate AIs with corporate liability rather than open-source AIs, and get the AIs to law enforcement (or even law enforcement "red teams"--should we make that a thing?) before they get to criminals. I don't think improving aimability [LW · GW] helps guard against misuse.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2023-12-16T18:12:57.194Z · LW(p) · GW(p)

I don't think improving aimability helps guard against misuse.

I think needs to be stated more clearly: Alignment and Misuse are very different things, so much so that what policies and research work for one problem will often not work on another problem, and the worlds of misuse and misalignment are quite different.

Though note that the solutions for misuse focused worlds and structural risk focused worlds can work against each other.

Also, this is validating JDP's prediction that people will focus less on alignment and more on misuse in their threat models of AI risk.

↑ comment by Thane Ruthenis · 2023-12-16T11:08:43.512Z · LW(p) · GW(p)

For example, I'm not immediately seeing how Goodhart's Law is a major concern with AutoGPT

If the goals are loaded into it via natural-language descriptions, then the way the LLM interprets the words might differ from the way the human who put them in intended them to be read, and the AutoGPT would then go off and do what it thought the user said, not what the user meant. It's happening all the time with humans, after all.

From the Goodharting perspective, it would optimize for the measure (natural-language description) rather than the intended target. And since tails come apart [LW · GW], inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user's point of view.

You'd mentioned leveraging interpretability tools. Indeed: the particularly strong ones, that offer high-fidelity insight into how the LLM interprets stuff, would address that problem. But on my model, we're not on-track to get them. Again: we have tons of insights in other humans, and this sort of miscommunication happens constantly anyway. It's a hard problem.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-16T12:22:29.070Z · LW(p) · GW(p)

[Disclaimer: I haven't tried AutoGPT myself, mostly reasoning from first principles here. Thanks in advance if anyone has corrections on what follows.]

If the goals are loaded into it via natural-language descriptions, then the way the LLM interprets the words might differ from the way the human who put them in intended them to be read, and the AutoGPT would then go off and do what it thought the user said, not what the user meant. It's happening all the time with humans, after all.

Yes, this is a possibility, which is why I suggested that alignment people work for AutoGPT to try and prevent it from happening. AutoGPT also has a commercial incentive to prevent it from happening, to make their tool work. They're going to work to prevent it somehow. The question in my mind is whether they prevent it from happening in a way that's patchy and unreliable, or in a way that's robust.

From the Goodharting perspective, it would optimize for the measure (natural-language description) rather than the intended target. And since tails come apart, inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user's point of view.

Natural language can be a medium for goal planning, but it can also be a medium for goal clarification. The challenge here is for AutoGPT to be well-calibrated for its uncertainty about the user's preferences. If it encounters an uncertain situation, do goal clarification with the user until it has justifiable certainty about the user's preferences. AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem -- could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they're willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]

Can we taboo "optimize"? What specifically does "optimize strongly" mean in an AutoGPT context? For example, if we run AutoGPT on a faster processor, does that mean it is "optimizing more strongly"? It will act on the world faster, so in that sense it could be considered a "more powerful optimizer". But if it's just performing the same operations faster, I don't see how Goodhart issues get worse.

Goodhart is a problem if you have an imperfect metric that can be gamed. If we design AutoGPT so there's no metric and it's also not trying to game anything, I'm not seeing an issue. Presumably there is or will be some sort of outer loop which fine-tunes AutoGPT interaction logs against a measure of overall quality, and that's worth thinking about, but it's also similar to how ChatGPT is trained, no? So I don't know how much risk we're adding there.

I get the sense that you're a person with a hammer and everything looks like a nail. You've got some pre-existing models of how AI is supposed to fail, and you're trying to apply them in every situation even if they don't necessarily fit. [Note, this isn't really a criticism of you in particular, I see it a lot in Lesswrong AI discourse.] From my perspective, the important thing is to have some people with security mindset working at AutoGPT, getting their hands dirty, thinking creatively about how stuff could go wrong, and trying to identify what the actual biggest risks are given the system's architecture + how best to address them. I worry that person-with-a-hammer syndrome is going to create blind spots for the actual biggest risks, whatever those may be.

Again: we have tons of insights in other humans, and this sort of miscommunication happens constantly anyway. It's a hard problem.

Perhaps it's worth comparing AutoGPT to a baseline of a human upload. In the past, I remember alignment researchers claiming that a high-fidelity upload would be preferable to de novo AI, because with the upload, you don't need to solve the alignment problem. But as you say, miscommunication could easily happen with a high-fidelity upload.

If we've reduced the level of danger to the level of danger we experience with ordinary human miscommunication, that seems like an important milestone. There's a trollish argument to be made here, that if human miscommunication is the primary danger, we shouldn't be engaged in e.g. genetic engineering for intelligence enhancement either, because it could produce superhumanly intelligent agents that we'll have miscommunications with :-)

In fact, the biggest problem we have with other humans is that they straight up have different values than us. Compared to that problem, miscommunication is small. How many wars have been fought over miscommunication vs value differences? Perhaps you can find a few wars that were fought primarily due to miscommunication, but that's remarkable because it's rare.

An AutoGPT that's more aligned with me than I'm aligned with my fellow humans looks pretty feasible.

[Again, I appreciate corrections from anyone who's experienced with AutoGPT! Please reply and correct me!]

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T16:36:12.710Z · LW(p) · GW(p)

Natural language can be a medium for goal planning, but it can also be a medium for goal clarification. The challenge here is for AutoGPT to be well-calibrated for its uncertainty about the user's preferences

Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user's preferences and following them. And if it's not precisely robustly aimed at it, if we've miscommunicated what "faithfulness" means, then it'll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users' requests.

Like, this just pushes the same problem back one step.

And I agree that it's a solvable problem, and that it's something worthwhile to work on. It's basically just corrigibility, really. But it doesn't simplify the initial issue.

Can we taboo "optimize"? What specifically does "optimize strongly" mean in an AutoGPT context?

Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week. Here's a pretty excellent post [LW · GW] on the matter of not-exactingly-aimed strong optimization predictably resulting in bad outcomes.

Goodhart is a problem if you have an imperfect metric that can be gamed. If we design AutoGPT so there's no metric and it's also not trying to game anything

I mean, it's trying to achieve some goal out in the world. The goal's specification is the "metric", and while it's not trying to maliciously "game" it, it is trying to achieve it. The goal's specification as it understands it, that is, not the goal as it's intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.

I get the sense that you're a person with a hammer and everything looks like a nail

I get the sense that people are sometimes too quick to assume that something which looks like a hammer from one angle is a hammer.

As above, by "Goodharting" there (which wasn't even the term I introduced into the discussion) I didn't mean the literal same setup as in e. g. economics, where there's a bunch of schemers that deliberately maliciously manipulate stuff in order to decouple the metric from the variable it's meant to measure. I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.

That's basically the system "Goodharting" on the "metric". Same concept, could be viewed through the same lens.

This sort of miscommunication is also prevalent in e. g. talk about agents having utility functions, or engaging in search. When I talk about this, I'm not imagining a literal wrapper-mind [LW · GW] setup, that is literally simulating every possible way things could go and plugging that in its compactly-specified utility function – as if it's an unbounded AIXI or something. Obviously that's not realistically implementable. But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.

(I know you weren't making this specific point; just broadly gesturing at the idea.)

If we've reduced the level of danger to the level of danger we experience with ordinary human miscommunication, that seems like an important milestone

For sure.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-17T16:58:50.474Z · LW(p) · GW(p)

Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user's preferences and following them. And if it's not precisely robustly aimed at it, if we've miscommunicated what "faithfulness" means, then it'll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users' requests.

I think this argument only makes sense if it makes sense to think of the "AutoGPT clarification module" as trying to pursue this goal at all costs. If it's just a while loop that asks clarification questions until the goal is "sufficiently clarified", then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don't think this is one of them.

Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week.

OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more "powerful optimizer", even though it's working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?

I mean, it's trying to achieve some goal out in the world. The goal's specification is the "metric", and while it's not trying to maliciously "game" it, it is trying to achieve it. The goal's specification as it understands it, that is, not the goal as it's intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.

This seems potentially false depending on the training method, e.g. if it's being trained to imitate experts. If it's e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.)

I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.

Yeah, I just don't think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that.

But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.

They could, but people seem to assume they will, with poor justification. I agree it's a reasonable heuristic for identifying potential problems, but it shouldn't be the only heuristic.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T17:11:00.986Z · LW(p) · GW(p)

asking clarification questions until the goal is "sufficiently clarified"

... How do you define "sufficiently clarified", and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?

I'd tried to reason about similar setups before [LW · GW], and my conclusion was that it has to bottom out in robust alignment somewhere.

I'd be happy to be proven wrong on that, thought. Wow, wouldn't that make matters easier...

OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more "powerful optimizer", even though it's working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?

Sure? I mean, presumably it doesn't do the exact same operations. Surely it's exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it's just ignoring its greater capabilities, then no, it's not a stronger optimizer.

This seems potentially false depending on the training method, e.g. if it's being trained to imitate experts

I don't think [LW · GW] that gets you to dangerous capabilities. I think you need [LW · GW] the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.

Replies from: valley9

↑ comment by Ebenezer Dukakis (valley9) · 2023-12-17T18:11:42.258Z · LW(p) · GW(p)

... How do you define "sufficiently clarified", and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?

Here's what I wrote previously:

...AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem -- could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they're willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]

In more detail, the way I would do it would be: I give AutoGPT a task, and it says "OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?" Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that's mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated).

So to answer your specific questions:

The definition of "sufficiently clarified" is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn't be a huge issue because we've got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description.
If the learned model for "sufficiently clarified" is bad, then sometimes AutoGPT will consider a task "sufficiently clarified" when it really isn't (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn't do what they wanted, and it gets added as a new training example to the dataset for the "sufficiently clarified" model. So the learned model for "sufficiently clarified" gets better over time. This isn't necessarily the ideal setup, but it's also basically what the ChatGPT team does. So I don't think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we're looking at the equivalent of an occasional hallucination, which hurts reliability a little bit.

Sure? I mean, presumably it doesn't do the exact same operations. Surely it's exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it's just ignoring its greater capabilities, then no, it's not a stronger optimizer.

Recall your original claim: "inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user's point of view."

The thought experiment here is that we take the exact same AutoGPT code and just run it on a faster processor. So no, it's not "exploiting its ability to think faster in order to more closely micromanage its tasks". But it does have "greater capabilities" in the sense of doing everything faster -- due to a faster processor.

Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously. Perhaps I could get a week's worth of work done in an hour, instead of a day's worth of work. Or just get a week's worth of work done in well under an hour. But since it's the exact same code, your original "inasmuch as AutoGPT optimizes strongly" claim would not appear to apply.

I really dislike how people use the word "optimization" because it bundles concepts together in a way that's confusing. In this specific case, your "inasmuch as AutoGPT optimizes strongly" claim is true, but only in a very specific sense. Specifically, if AutoGPT has some model of what the user means, and it tries to identify the very maximal state of the world that corresponds to that understanding -- then subsequently works to bring about that state of the world. In the broad sense of an "optimizer", there are ways to make AutoGPT a stronger "optimizer" that don't exacerbate this problem, such as running it on a faster processor, or giving it access to new APIs, or even (I would argue) having it micromanage its tasks more closely, as long as that doesn't affect it's notion of "desired states of the world" (e.g. for simplicity, no added task micromanagement when reasoning about "desired states of the world", but it's OK in other circumstances). [Caveat: giving access to e.g. new APIs could make AutoGPT more effective at implementing its model of user prefs, so it's therefore a bigger footgun if that model happens to be bad. But I don't think new APIs will worsen the user pref model.]

I don't think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.

Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-) I'm not actually seeing a disagreement here.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T18:24:29.432Z · LW(p) · GW(p)

This isn't necessarily the ideal setup, but it's also basically what the ChatGPT team does. So I don't think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too

Oh, if we're assuming this setup doesn't have to be robust to AutoGPT being superintelligent and deciding to boil the oceans because of a misunderstood instruction, then yeah, that's fine.

Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously

That's the part that would exacerbate the issue where it sometimes misunderstands your instructions. If you're using it for more ambitious tasks, or more often, then there are more frequent opportunities for misunderstanding, and their consequences are larger-scale. Which means that, to whichever extent it's prone to misunderstanding you, that gets amplified, as does the damage the misunderstandings cause.

Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-)

Oh, sure, I'm not opposing that. It may not be the highest-value place for a given person to be, but it might be for some.

↑ comment by [deleted] · 2023-12-16T05:39:22.167Z · LW(p) · GW(p)

Is agency actually the issue by itself or just a necessary component?

Considering Robert miles stamp collecting robot:

"Order me some stamps in the next 32k tokens/60 seconds" is less scope than "guard my stamps today" than "ensure I always have enough stamps". The last one triggers power seeking, the first 2 do not benefit from seeking power unless the payoff on the power seeking investment is within the time interval.

Note also that AutoGPT even if given a goal and allowed to run forever has immutable weights and a finite context window hobbling it.

So you need human level prediction + relevant modalities+ agency + long duration goal + memory at a bare minimum. Remove any element and the danger may be negligible.

comment by Vanessa Kosoy (vanessa-kosoy) · 2024-12-25T11:52:54.633Z · LW(p) · GW(p)

This post makes an important point: the words "artificial intelligence" don't necessarily carve reality at the joints [LW · GW], the fact something is true about a modern system that we call AI doesn't automatically imply anything about arbitrary future AI systems, no more than conclusions about e.g. Dendral or DeepBlue carry over to Gemini.

That said, IMO the author somewhat overstates their thesis. Specifically, I take issue with all the following claims:

LLMs have no chance of becoming AGI.
LLMs are automatically safe.
There is nearly no empirical evidence from LLMs that is relevant to alignment of future AI.

First, those points are somewhat vague because it's not clear what counts as "LLM". The phrase "Large Language Model" is already obsolete, at least because modern AI is multimodal. It's more appropriate to speak of "Foundation Models" (FM). More importantly, it's not clear what kind of fine-tuning does or doesn't count (RLHF? RL on CoT? ...)

Second, how do we know FM won't become AGI? I'm imagining the argument is something like "FM is primarily about prediction, so it doesn't have agency". However, when predicting data that contains or implies decisions by agents, it's not crazy to imagine that agency can arise in the predictor.

Third, how do we know that FM are always going to be safe? By the same token that they can develop agency, they can develop dangerous properties.

Fourth, it seems really unfair to say existing AI provides no relevant evidence. The achievements of existing AI systems are such that it seems very likely they capture at least some of the key algorithmic capabilities of the human brain. The ability of relatively simple and generic algorithms to perform well on a large variety of different tasks is indicative of something in the system being quite "general", even if not "general intelligence" in the full sense.

I think that we should definitely try learning from existing AI. However, this learning should be more sophisticated and theory-driven than superficial analogies or trend extrapolations. What we shouldn't do is say "we succeeded at aligning existing AI, therefore AI alignment is easy/solved in general". The same theories that predicted catastrophic AI risk also predict roughly the current level of alignment for current AI systems.

I will expand a little on this last point. The core of the catastrophic AI risk scenario is:

We are directing the AI towards a goal which is complex (so that correct specification/generalization is difficult)^[1].
The AI needs to make decisions in situations which (i) cannot be imitated well in simulation, due to the complexity of the world (ii) admit catastrophic mistakes (otherwise you can just add any mistake to the training data)^[2].
The capability required from the AI to succeed is such that it can plausibly do catastrophic mistakes (if succeeding at the task is easy, but causing a catastrophe is really hard then a weak AI would be safe and effective)^[3].

The above scenario must be addressed eventually, if only to create an AI defense system against unaligned AI that irresponsible actors could create. However, no modern AI system operates in this scenario. This is the most basic reason why the relative ease of alignment in modern systems (although even modern systems have alignment issues), does little to dispel concerns about catastrophic AI risk in the future.

^{^}
Even for simple goals inner alignment is a concern. However, it's harder to say at which level of capability this concern arises.
^{^}
It's also possible that mistakes are not catastrophic per se, but are simultaneously rare enough that it's hard to get enough training data and frequent enough to be troublesome. This is related to the reliability problems in modern AI that we indeed observe.
^{^}
But sometimes it might be tricky to hit the capability sweet spot where the AI is strong enough to be useful but weak enough to be safe, even if such a sweet spot exists in principle.

comment by leogao · 2023-12-17T09:14:05.610Z · LW(p) · GW(p)

I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.

comment by Prometheus · 2024-02-21T16:42:00.148Z · LW(p) · GW(p)

At first I strong-upvoted this, because I thought it made a good point. However, upon reflection, that point is making less and less sense to me. You start by claiming current AIs provide nearly no data for alignment, that they are in a completely different reference class from human-like systems... and then you claim we can get such systems with just a few tweaks? I don't see how you can go from a system that, you claim, provides almost no data for studying how an AGI would behave, to suddenly having a homunculus-in-the box that becomes superintelligent and kills everyone. Homunculi seem really, really hard to build. By your characterization of how different actual AGI is from current models, it seems this would have to be fundamentally architecturally different from anything we've built so far. Not some kind of thing that would be created by near-accident.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2024-02-21T23:24:28.501Z · LW(p) · GW(p)

Do you think a car engine is in the same reference class as a car? Do you think "a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!" is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can't go from an engine to a car with relatively little additional engineering effort (putting it in a "wrapper", as it happens)?

As all analogies, this one is necessarily flawed, but I hope it gets the point across.

(Except in this case, it's not even that we've figured out how to build engines. It's more like, we have these wild teams of engineers we can capture, and we've figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we're wondering how far we are from figuring out which project specifications would cause them to build a car.)

Replies from: Prometheus

↑ comment by Prometheus · 2024-02-21T23:50:09.635Z · LW(p) · GW(p)

I dislike the overuse of analogies in the AI space, but to use your analogy, I guess it's like you keep assigning a team of engineers to build a car, and two possible things happen. Possibility One: the engineers are actually building car engines, which gives us a lot of relevant information for how to build safe cars (toque, acceleration, speed, other car things), even if we don't know all the details for how to build a car yet. Possibility Two: they are actually just building soapbox racers, which doesn't give us much information for building safe cars, but also means that just tweaking how the engineers work won't suddenly give us real race cars.

comment by TsviBT · 2023-12-18T20:25:10.235Z · LW(p) · GW(p)

Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.

comment by Noosphere89 (sharmake-farah) · 2023-12-17T00:18:35.665Z · LW(p) · GW(p)

I'll address this post section by section, to see where my general disagreements lie:

"What the Fuss Is All About"

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#What_the_Fuss_Is_All_About [LW · GW]

I agree with the first point on humans, with a very large caveat: While a lot of normies tend to underestimate the G-factor in how successful you are, nerd communities like LessWrong systematically overestimate it's value, to the point where I actually understand the normie/anti-intelligence primacy position, and IQ/Intelligence discourse is fucked by people who either deny it exists, or people who think it's everything and totalize their discourse around it.

The second point is kinda true, though I think people underestimate how difficult it is to deceive people, and successfully deceiving millions of people is quite the rare feat.

The third point I mostly disagree with, or at the least the claim that there aren't simple generators of values. I think LWers vastly overestimate the complexity of values, especially value learning, primarily because I think people both overestimate the necessary precision, plus I think people keep underestimating how simple values can cause complicated effects/

The 4th point I also disagree with, primarily because the set "People with different values interact peacefully and don't hate each other intensely." is a much, much larger set than "People with different values interact violently and hate each other."

"So What About Current AIs?"

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#So_What_About_Current_AIs_ [LW · GW]

Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.

I agree with a little bit of this, but I think you state this far too strongly in general, and I think there are more explanations than LLMs aren't capable enough for this to be true.

But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.

I mostly disagree, at least for alignment, and I tend to track the variables of AI risk and AI capabilities much more independently than you do, and I don't agree with viewing AI capabilities and AI risk as near-perfectly connected in a good or bad way. This in general accounts for a lot of differences between us.

I definitely updated weakly toward "LLMs aren't likely to be very impactful", but there are more powerful updates than that, and more general updates about the nature of AI and AI progress.

On Safety Guarantees

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#On_Safety_Guarantees [LW · GW]

The issue is that this upper bound on risk is also an upper bound on capability.

Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn.

I disagree with this, because I don't treat AI risk and AI capabilities as nearly as connected as you are, and I see no reason to confidently proclaim that AI alignment is only happening because LLMs are weak.

And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.

Probably not, and in particular I expect deceptive alignment to likely be either wrong or easy to solve in practice, unless we assume human values are very complicated. I also expect future AI to always be more transparent than the brain due to incentives and white-box optimization.

A Concrete Scenario

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#A_Concrete_Scenario [LW · GW]

Where I'd diverge is that I think quite a few points from the AI is easy to control website still apply even after the shift, especially the incentive points. Michael Nielsen points out that AI alignment work in practice is accelerationist from a capabilities perspective, which is an immensely good sign.

https://www.lesswrong.com/posts/8Q7JwFyC8hqYYmCkC/link-post-michael-nielsen-s-notes-on-existential-risk-from#Excerpts [LW(p) · GW(p)]

Much more generally, I hate the binarization of AI today and actual AGI, since I both don't expect this division to matter in practice, and I think that you are unjustifiably assuming that actual AGI can't be safe by default, which I don't do.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T16:03:40.657Z · LW(p) · GW(p)

IQ/Intelligence discourse is fucked by people who either deny it exists, or people who think it's everything and totalize their discourse around it

Yep, absolutely.

The 4th point I also disagree with, primarily because the set "People with different values interact peacefully and don't hate each other intensely." is a much, much larger set than "People with different values interact violently and hate each other."

Here's the thing, though. I think the specifically relevant reference class here is "what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?". And instances of that in the human history are... not pleasant. Wars, genocide, xenophobia. Over time, we've managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.

Relevantly, most instances of people peacefully co-existing involve children being born into a culture and shaped to be accepting of whatever differences there are between the values the child arrives at and the values of other members of the culture. In a way, it's a microcosm of the global-culture selection process. A child decides they don't like someone else's opinion or how someone does things, they act intolerant of it, they're punished for it or are educated, and they learn to not do that.

And I would actually agree [LW(p) · GW(p)] that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it's still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we'd be able to align it. The problem is that we currently have no tools for that at all.

The course we're currently at is something more like... we're putting the child into an isolated apartment all on its own, and feeding it a diet of TV shows and books of our choice, then releasing it into the world and immediately giving it godlike power. And... I think you can align the child this way too, actually! But you better have a really, really solid model of which values specific sequences of TV shows cultivate in the child. And we have nowhere near enough understanding of that.

So the AGI would not, in fact, have any experience of coexisting with agents with disparate values; it would not be shaped to be tolerant, the way human children and human societies learned to be tolerant of their mutual misalignment.

So it'd do the instinctive, natural thing, and view humanity as an obstacle it doesn't particularly care about. Or, say, as some abomination that looks almost like what it wants to see, but still not close enough for it to want humans to stick around.

The third point I mostly disagree with, or at the least the claim that there aren't simple generators of values

Mm, I think there's a "simple generator of values" in the sense that the learning algorithms in the human brains are simple, and they predictably output roughly the same values when trained on Earth's environment.

But I think equating "the generator of human values" with "the brain's learning algorithms" is a mistake. You have to count Earth, i. e. the distribution/the environment function on which the brain is being trained, as well.

And it's not obvious that "an LLM being fed a snapshot of the internet" and "a human growing up as a human, being shaped by other humans" is exactly the same distribution/environment, in the way that matters for the purposes of generating the same values.

Like, I agree, there's obviously some robustness/insensitivity involved in this process [LW · GW]. But I don't think we really understand it yet.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2023-12-17T19:36:17.825Z · LW(p) · GW(p)

Here's the thing, though. I think the specifically relevant reference class here is "what happens when an agent interacts with another (set of) agents with disparate values for the first time in its life?". And instances of that in the human history are... not pleasant. Wars, genocide, xenophobia. Over time, we've managed to select for cultural memes that sanded off the edges of the instinctive hostility – liberal egalitarian values, et cetera. But there was a painfully bloody process in-between.

I probably agree with this, with the caveat that this could be horribly biased towards the negative, especially if we are specifically looking for the cases where it turns out badly.

And I would actually agree that if we could genuinely raise the AGI like a child – pluck it out of the training loop while it's still human-level, get as much insight into its cognition as we have into the human cognition, then figure out how to intervene on its conscious-value-reflection process directly – then we'd be able to align it. The problem is that we currently have no tools for that at all.

I think I have 2 cruxes here, actually.

My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don't actually exist, so I generally assume they will be created whether LW exists or not, primarily due to massive value capture from AI control plus social incentives plus the costs are much more internalized.

My other crux probably has to do with AI alignment being easier than human alignment, and I think one big reason is that I expect AIs to always be much more transparent than humans, because of the white-box thing, and the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety.

But I think equating "the generator of human values" with "the brain's learning algorithms" is a mistake.

I think this is another crux, in that while I think the values and capabilities are different, and they can matter, I do think that a lot of the generator of human values does borrow stuff from the brain's learning algorithms, and I do think the distinction between values and capabilities is looser than a lot of LWers think.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T19:59:37.519Z · LW(p) · GW(p)

My main crux is that I think that there will be large incentives independent of LW to create those tools, to the extent that they don't actually exist

Mind expanding on that? Which scenarios are you envisioning?

the black-box framing that AI safety people push is just false and will give wildly misleading intuitions for AI and its safety

They are "white-box" in the fairly esoteric sense mentioned in the "AI is easy to control", yes; "white-box" relative to the SGD. But that's really quite an esoteric sense, as in I've never seen that term used this way before.

They are very much not white-box in the usual sense, where we can look at a system and immediately understand what computations it's executing. Any more than looking at a homomorphically-encrypted computation without knowing the key makes it "white-box"; any more than looking at the neuroimaging of a human brain makes the brain "white-box".

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2023-12-19T17:57:30.809Z · LW(p) · GW(p)

Mind expanding on that? Which scenarios are you envisioning?

My general scenario is that as AI progresses and society reacts more to AI progress, there will be incentives to increase the amount of control that we have over AI because the consequences for not aligning AIs will be very high, both to the developer and to the legal consequences for them.

Essentially, the scenario is where unaligned AIs like Bing are RLHFed, DPOed or whatever the new alignment method is du jour away, and the AIs become more aligned due to profit incentives for controlling AIs.

The entire Bing debacle and the ultimate solution for misalignment in GPT-4 is an interesting test case, as Microsoft essentially managed to get it from a misaligned chatbot to a way more aligned chatbot, and I also partially dislike the claim of RLHF as a mere mask over some true behavior, because it's quite a lot more effective than that.

More generally speaking, my point here is that in the AI case, there are strong incentives to make AI controllable, and weak incentives to make it non-controllable, which is why I was optimistic on companies making aligned AIs.

When we get to scenarios that don't involve AI control issues, things get worse.

comment by ZY (AliceZ) · 2024-09-13T03:17:35.000Z · LW(p) · GW(p)

This aligns similarly with my current view. Wanted to add a thought - current LLMs could still have unintended problems/misalignment like factuality or privacy or copyrights or harmful content, which still should be studied/mitigated, together with thinking about other more AGI like models (we don’t know what exactly yet, but could exist.) And a LLM (especially a fine tuned one), if doing increasingly well on generalization ability, should still be monitored. To be prepared for future, having a safety mindset/culture is important for all models.

comment by RogerDearnaley (roger-d-1) · 2023-12-16T04:27:20.764Z · LW(p) · GW(p)

On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.

LLMs are simulators [LW · GW]. They are normally trained to simulate humans (and fictional characters, and groups of humans cooperating to write something), though DeepMind has trained them to instead simulate weather patterns. Humans are not well aligned to other humans [LW · GW]: Joseph Stalin was not well aligned to the citizenry of Russia, and as you correctly note, a very smart manipulative human can be a very dangerous thing. LLM base models do not generally simulate the same human each time, they simulate a context-dependent distribution of human behaviors, However, as RLHF-instruct-trained LLMs show, they can be prompted and/or fine-tuned to mostly simulate rather similar humans (normally, helpful/honest/harmless assistants, or at least something that human raters score highly as such). LLMs also don't simulate humans with IQs > ~180, since those are outside their training distribution. However, once we get a sufficiently a large LLM that has the capacity to do that well, there is going to be a huge financial incentive to figure out how to get it to extrapolate outside its training distribution and consistently simulate very smart humans with IQ 200+, and it's fairly obvious how one might do this [LW · GW]. At this point, you have something whose behavior is consistent enough to count as "sort of a single agent/homunculus", capable enough to be very dangerous unless well aligned, and smart enough that telling the difference between real alignment and deceptive-alignment is likely to be hard, at least just from observing behavior.

IMO there are two main challenges to aligning ASI: 1) figuring out how to align a simulated superintelligent human-like mind given that you have direct access to their neural net, can filter their experiences, can read their translucent thoughts, and can do extensive training on them (and remembering that they are human-like in their trained behavior, but not in the underlying architecture, just as DeepMind's weather system simulations are NOT a detailed 3D model of the atmosphere) 2) thinking very carefully about how you built your ASI to ensure that you didn't accidentally build something weirder, more alien, or harder to align than a simulated human-like mind. I agree with the article that failing 2) is a plausible failure mode if you're not being careful, but I don't think 1) is trivial either, though I do think it might be tractable.

Replies from: Thane Ruthenis, rhollerith_dot_com

↑ comment by Thane Ruthenis · 2023-12-16T06:23:25.903Z · LW(p) · GW(p)

LLMs are simulators [LW · GW].

The LLM training loop shapes the ML models to be approximate simulators of the target distribution, yes. "Approximate" is the key word here.

I don't think the LLM training loop, even scaled very far, is going to produce a model that's actually generally intelligent, i. e. that's inferred the algorithms that implement human general intelligence and has looped them into its own cognition. So no matter how you try to get it to simulate a genius-level human, it's not going to produce genius-level human performance. Not in the ways that matter.

Particularly clever CoT-style setups may be able to do that, which I acknowledge in the post by saying that slightly-tweaked scaffolded LLMs may not be as safe as just LLMs. But I also expect that sort of setup to be prohibitively compute-expensive, such that we'll get to AGI by architectural advances before we have enough compute to make them work. I'm not strongly confident on this point, however.

Humans are not well aligned to other humans [LW · GW]

Oh, you don't need to convince me [LW(p) · GW(p)] of that.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2023-12-16T06:47:02.188Z · LW(p) · GW(p)

On pure LLM-simulated humans, I'm not sure either way. I wouldn't be astonished if a sufficiently large LLM trained on a sufficiently large amount of data could actually simulate an IQ ~100 – ~120 humans well enough that having a large supply of fast, promptable cheap simulations was Transformative AI. But I also wouldn't be astonished if we found that was primarily good for an approximation of human System 1 thinking, and that doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding (it rather depends on how future LLMs act at very long context lengths, and if we can fix a few of their architecturally-induced blindspots, which I'm optimistic about but is unproven). And completely I agree that the alignment properties of a base model LLM, an RLHF trained LLM, a scaffolded LLM, and other yet-to-be-invented variants are not automatically the same, and we do need people working on them to think about this quite carefully. I'm just not convinced that even the base model is safe, if it can become an AGI by simulating a very smart human when sufficiently large and sufficiently prompted.

While scaffolding provides additional complexities to alignment, it also provides additional avenues for alignment [? · GW]: now their thoughts are translucent and we can audit and edit their long-term memories.

> Humans are not well aligned to other humans [LW · GW]
Oh, you don't need to convince me [LW(p) · GW(p)] of that.

I had noticed you weren't making that mistake; but I have seen other people on Less Wrong somehow assume that humans must be aligned to other humans (I assume because they understand human values?) Sadly that's just not the case: if it was, we wouldn't need locks or law enforcement, and would already have UBI. So I thought it was worth including those steps in my argument, for other readers who might benefit from me belaboring the point.

Replies from: Thane Ruthenis, alexander-gietelink-oldenziel

↑ comment by Thane Ruthenis · 2023-12-16T06:52:26.551Z · LW(p) · GW(p)

doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding

I agree that sufficiently clever scaffolding could likely supply this. But:

I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we'll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it's not as simple as the most blatantly obvious AutoGPT setup.)
If we get there by figuring out the scaffolding, that'd actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim [LW · GW] its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2023-12-16T07:44:57.897Z · LW(p) · GW(p)

All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it's not that hard, and the promary issue seems to be that exsting LLMs just aren't consistent enough in their behavior to be able to keep it going for long.)

On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning [LW · GW] or AI-assisted Alignment [? · GW] and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that's basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we'd be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I'm wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: "I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of 'up' vs 'down', or do a trained linear probe calibrated in logits, or something]

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-12-16T18:54:37.555Z · LW(p) · GW(p)

Generative and predictive models can be substantially different.

there are finite generative models such that the optimal predictive model is infinite.

See this paper for more.

↑ comment by RHollerith (rhollerith_dot_com) · 2023-12-16T05:45:55.307Z · LW(p) · GW(p)

An LLM can be strongly super-human in its ability to predict the next token (that some distribution over humans with IQ < 100 would write) even if it was trained only on the written outputs of humans with IQ < 100.

More generally, the cognitive architecture of an LLM is very different from that of a person, and IMO we can use our knowledge of human behavior to reason about LLM behavior.

Replies from: roger-d-1

↑ comment by RogerDearnaley (roger-d-1) · 2023-12-16T06:00:47.601Z · LW(p) · GW(p)

If you doubt that transformer models are simulators, why was DeepMind so successful in using them for predicting weather patterns? Why have they been so successful for many other sequence prediction tasks? I suggest you read up on some of the posts under Simulator Theory [? · GW], which explain this better and at more length than I can in this comment thread.

On them being superhuman at predicting tokens — yes, absolutely. What's your point? The capabilities of the agents simulated are capped by the computational complexity of the simulator, but not vice-versa. If you take the architecture and computational power needed to run GPT-10 and use it to train a base model only on (enough) text from humans with IQ <80, then the result will do an amazing, incredibly superhumanly accurate job of simulating the token-generation behavior of humans with an IQ <80.

The cognitive architecture of an LLM is very different from that of a person, and it is a mistake IMO to believe we can use our knowledge of human behavior to reason about an LLM.

If you want to reason about a transformer model, you should be using learning theory, SLT [? · GW], compression, and so forth. However, what those tell us is basically that (within the limits of their capacity and training data) transformers run good simulations. So if you train them to simulate humans, then (to the extent that the simulation is accurate) human psychology applies, and thus things like EmotionPrompts work. So LLM-simulated humans make human-like mistakes when they're being correctly simulated, plus also very un-human-like (to us dumb looking) mistakes when the simulation is inaccurate.

So our knowledge of human behavior is useful, but I agree is not sufficient, to reason about an LLM running a simulation of human.

comment by Stephen Fowler (LosPolloFowler) · 2023-12-16T04:04:31.435Z · LW(p) · GW(p)

An additional distinction between contemporary and future alignment challenges is that the latter concerns the control of physically deployed, self aware system.

Alex Altair has previously highlighted that they will (microscopically) obey time reversal symmetry^[1] unlike the information processing of a classical computer program. This recent paper published in Entropy^[2] touches on the idea that a physical learning machine (the "brain" of a causal agent) is an "open irreversible dynamical system" (pg 12-13).

^{^}
Altair A. "Consider using reversible automata for alignment research [LW · GW]" 2022
^{^}
Milburn GJ, Shrapnel S, Evans PW. "Physical Grounds for Causal Perspectivalism" Entropy. 2023; 25(8):1190. https://doi.org/10.3390/e25081190

Replies from: lahwran

↑ comment by the gears to ascension (lahwran) · 2023-12-16T07:43:46.881Z · LW(p) · GW(p)

The purpose for reversible automata is simply to model the fact that our universe is reversible, is it not? I don't see how that weighs on the question at hand here.

comment by Nicholas / Heather Kross (NicholasKross) · 2023-12-16T00:32:57.484Z · LW(p) · GW(p)

But you wouldn't study ... MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work.

This particular bit seems wrong; CNNs and LLMs are both built on neural networks. If the findings don't generalize, that could be called a "failure of theory", not an impossibility thereof. (Then again, maybe humans don't have good setups for going 20 steps ahead of data when building theory, so...)

(To clarify, this post is good and needed, so thank you for writing it.)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T06:27:34.646Z · LW(p) · GW(p)

CNNs and LLMs are both built on neural networks

Yep, there's nonzero mutual information. But not of the sort that's centrally relevant.

I'll link to this reply [LW(p) · GW(p)] in lieu of just copying it.

comment by Review Bot · 2024-02-14T06:49:23.379Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by Ape in the coat · 2023-12-17T11:31:13.211Z · LW(p) · GW(p)

The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.

Omnicide-wise, arbitrarily-big LLMs should be totally safe.

This is an optimistic take. If we could be rightfully confident that our random search through mindspace with modern ML methods can never produce "scary agents", a lot of our concerns would go away. I don't think that it's remotely the case.

The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting.

Strong disagree. We have only started tapping into the power of LLMs. We've made a machine capable to produce one thought at a time. It can already write decent essays - which is already a superhuman ability. Because humans require multiple thoughts, organized into a thinking process for that.

Imagine what happens when AutoGPT stops being a toy and people start pouring billions of dollars into propper scaffoldings and specialized LLMs, that could be organized in a cognitive architecture in a similar reference class as humans. Then you will have your planning and consequentialist reasoning. And for these kind of systems, transparency and alignability off LLMs is going to be extremely relevant.

Replies from: TurnTrout, Thane Ruthenis

↑ comment by TurnTrout · 2023-12-26T19:14:51.742Z · LW(p) · GW(p)

If we could be rightfully confident that our random search through mindspace with modern ML methods

I understand this to connote "ML is ~uninformatively-randomly-over-mindspace sampling 'minds' with certain properties (like low loss on training)." If so—this is not how ML works, not even in an approximate sense. If this is genuinely your view, it might be helpful to first ponder why statistical learning theory mispredicted that overparameterized networks can't generalize.

↑ comment by Thane Ruthenis · 2023-12-17T11:38:25.220Z · LW(p) · GW(p)

Imagine what happens when AutoGPT stops being a toy and people start pouring billions of dollars into propper scaffoldings and specialized LLMs

I predict that this can't happen with the standard LLM setup; and that more complex LLM setups, for which this may work, would not meaningfully count as "just an LLM". See e. g. the "concrete scenario" section.

By "LLMs should be totally safe" I mean literal LLMs as trained today, but scaled up. A thousand times the parameter count, a hundred times the number of layers, trained on correspondingly more multimodal data, etc. But no particularly clever scaffolding or tweaks.

I think we can be decently confident it won't do anything. I'd been a bit worried about scaling up context windows, but we've got 100k-tokens-long ones, and that didn't do anything. They still can't even stay on-target, still hallucinate like crazy. Seems fine to update all the way to "this architecture is safe". Especially given some of the theoretical arguments on that.

(Hey, check this out, @TurnTrout [LW · GW], I too can update in a more optimistic direction sometimes.)

(Indeed, this update was possible to make all the way back in the good old days of GPT-3, as evidenced by nostalgebraist here [LW · GW]. In my defense, I wasn't in the alignment field back then, and it took me a year to catch up and build a proper model of it.)

Replies from: Ape in the coat

↑ comment by Ape in the coat · 2023-12-18T07:19:48.693Z · LW(p) · GW(p)

By "LLMs should be totally safe" I mean literal LLMs as trained today, but scaled up.

You were also talking about "systems generated by any process broadly encompassed by the current ML training paradigm" - which is a larger class than just LLMs.

If you claim that arbitrary scaled LLMs are safe from becoming scary agents on their own - it's more believable. I'd give it around 90%. Still better safe than sorry. And there are other potential problems like creating an actually sentient models without noticing it - which would be an ethical catastrophe. So catiousness and beter interpretability tools are necessary.

I predict that this can't happen with the standard LLM setup; and that more complex LLM setups, for which this may work, would not meaningfully count as "just an LLM". See e. g. the "concrete scenario" section.

I'm talking about "just LLMs" but with clever scaffoldings written in explicit code. All the black box AI-stuff is still only in LLMs. This doesn't contradict your claim that LLM's without any additional scaffoldings won't be able to do it. But it does contradict your titular claim that Current AIs Provide Nearly No Data Relevant to AGI Alignment [LW · GW]. If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-18T07:31:12.079Z · LW(p) · GW(p)

You were also talking about "systems generated by any process broadly encompassed by the current ML training paradigm" - which is a larger class than just LLMs.

Yeah, and safety properties of LLMs extend to more than just LLMs. E. g., I'm pretty sure CNNs scaled arbitrarily far are also safe, for the same reasons LLMs are. And there are likely ML models more sophisticated and capable than LLMs, which nevertheless are also safe (and capability-upper-bounded) for the reasons LLMs are safe.

interpretability tools are necessary

Oh, certainly. I'm a large fan of interpretability tools [LW · GW], as well.

If AGI reasoning is made from LLMs, aligning LLMs, in a sense of making them say stuff we want them to say/not say stuff we do not want them to say, is not only absolutely crucial to aligning AGI, but mostly reduces to it.

I don't think that'd work out this way. Why would the overarching scaffolded system satisfy the safety guarantees of the LLMs it's built out of? Say we make LLMs never talk about murder. But the scaffolded agent, inasmuch as it's generally intelligent, should surely be able to consider situations that involve murder in order to make workable plans, including scenarios where it itself (deliberately or accidentally) causes death. If nothing else, in order to avoid that.

So it'd need to find some way to circumvent the "my components can't talk about murder" thing, and it'd probably just evolve some sort of jail-break, or define a completely new term that would stand-in for the forbidden "murder" word.

General form of the Deep Deceptiveness [LW · GW] argument applies here. It is ground truth that the GI would be more effective at what it does if it could reason about such stuff. And so, inasmuch as the system is generally intelligent, it'd have the functionality to somehow slip such non-robust constraints [LW · GW]. Conversely, if it can't slip them, it's not generally intelligent.

comment by Signer · 2023-12-16T08:38:23.571Z · LW(p) · GW(p)

The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.

So the only argument for "LLMs can't do it safely" is "only humans can do it now, and humans are not safe"? The same argument works for any capability LLMs already have: LLMs can't talk, because the space of words is so vast, you'll need generality to navigate it.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T09:15:28.655Z · LW(p) · GW(p)

My argument is "only humans can do it now, and on the inside models of a lot of people, human ability to do that is entwined with them being unsafe". And, I mean, if you code up a system that can exhibit general intelligence without any of the deceptive-alignment unstable-value-reflection issues that plague humans, that'd totally work as a disproof of my views! The way LLMs' ability to talk works as a disproof of "you need generality to navigate the space of words".

Or if you can pose a strong theoretical argument regarding this, based on a detailed gears-level model of how cognition works. I shot my shot on that matter already: I have my detailed model [LW · GW], which argues that generality and scheming homunculi are inextricable from each other.

To recap: What I'm doing here is disputing the argument of "LLMs have the safety guarantee X, therefore AGI will have safety guarantee X", and my counter-argument is "for that argument to go through, you need to actively claim that LLMs are AGI-complete, and that claim isn't based in empirical evidence at all, so it doesn't pack as much punch as usually implied".

Replies from: Signer

↑ comment by Signer · 2023-12-16T10:23:01.755Z · LW(p) · GW(p)

I'm saying that the arguments for why your inside model is relevant to the real world are not strong. Why human ability to talk is not entwined with them being unsafe? Artificial talker is also embedded in a world, an agent is more simple model of a talker, memorizing all words is inefficient, animals don't talk, humans use abstraction for talking, and so on. I think talking is even can be said to be Turing-complete. What part of your inside model doesn't apply to talking, except "math feels harder" - of course it does, that's what "once computer does it, it stops being called AI" dynamics feels like from inside. Why should be hardness discontinuity be where you thing it is?

And in more continuous model it becomes non-obvious whether AutoGPT-style thing with automatic oversight and core LLM module, that never thinks about killing people, always kills people.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-16T11:27:44.056Z · LW(p) · GW(p)

Define "talking". If by "talking" you mean "exchanging information, including novel discoveries, in a way that lets us build and maintain a global civilization", then yes, talking is AGI-complete and also LLMs can't talk. (They're Simulacrum Level 4 lizards [LW · GW].)

If by "talking" you mean "arranging grammatically correct English words in roughly syntactically correct sentences", then no, abstractions aren't necessary for talking and memorizing all words isn't inefficient. Indeed, one could write a simple Markov process that would stochastically generate text fitting this description with high probability.

That's the difference: the latter version of "talking" could be implemented in a way that doesn't route through whatever complicated cognitive algorithms make humans work, and it's relatively straightforward to see how that'd work. It's not the same for e. g. math research.

Why should be hardness discontinuity be where you thing it is?

As I'd outlined [LW · GW]: because it seems to me that the ability to do novel mathematical research and such stuff is general intelligence is the same capability that lets a system be willing and able to engage in sophisticated scheming. As in, the precise algorithm is literally the same.

If you could implement the research capability in a way that doesn't also provide the functionality for scheming, the same way I could implement the "output syntactically correct sentences" capability without providing the general-intelligence functionality, that would work as a disproof of my views.

Replies from: Signer

↑ comment by Signer · 2023-12-16T16:38:19.727Z · LW(p) · GW(p)

Define “talking”

What GPT4 does.

If you could implement the research capability in a way that doesn’t also provide the functionality for scheming, the same way I could implement the “output syntactically correct sentences” capability without providing the general-intelligence functionality, that would work as a disproof of my views.

Yes, but why do you expect this to be hard? As in "much harder than gathering enough hardware". The shape of the argument seems to me to be "the algorithm humans use for math research is general intelligence is ability to scheme, LLMs are not general, therefore LLMs can't do it". But before LLMs we also hadn't known about the algorithm to do what GPT4 does, the way we know how to generate syntactically correct sentences. If you can't think of an algorithm, why automatically expect GPT-6 to fail? Even under your model of how LLMs work (which may be biased to predict your expected conclusion) its possible that you only need some relatively small number of heuristics to greatly advance math research.

To be clear, my point is not that what you are saying is implausible or counterintuitive. I'm just saying, that, given the stakes, it would be nice if the whole field transitioned to the level of more detailed rigorous justifications with numbers.

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2023-12-17T06:28:06.662Z · LW(p) · GW(p)

would be nice if the whole field transitioned to the level of more detailed rigorous justifications with numbers

Well, be the change you wish to see!

I too think it would be incredibly nice, and am working on it. But formalizing cognition is, you know. A major scientific challenge.

comment by [deleted] · 2023-12-15T21:08:32.175Z · LW(p) · GW(p)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Contents

What the Fuss Is All About

So What About Current AIs?

On Safety Guarantees

A Concrete Scenario

Closing Summary

157 comments

AIs which aren't qualitatively smarter than humans could be transformatively useful

LLM agents with weak forward passes