LLMs seem (relatively) safe

post by JustisMills · 2024-04-25T22:13:06.221Z · LW · GW · 24 comments

This is a link post for https://justismills.substack.com/p/llms-seem-relatively-safe


  LLMs are self limiting
  LLMs are decent at human values
  Playing human roles is pretty human
  And So

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds of AI bills moving through the world’s legislatures, and Eliezer Yudkowsky is featured in Time Magazine.

For people who have been predicting, since well before AI was cool (and now passe), that it could spell doom for humanity, this explosion of mainstream attention is a dark portent. Billion dollar AI companies keep springing up and allying with the largest tech companies in the world, and bottlenecks like money, energy, and talent are widening considerably. If current approaches can get us to superhuman AI in principle, it seems like they will in practice, and soon.

But what if large language models, the vanguard of the AI movement, are actually safer than what came before? What if the path we’re on is less perilous than what we might have hoped for, back in 2017? It seems that way to me.

LLMs are self limiting

To train a large language model, you need an absolutely massive amount of data. The core thing these models are doing is predicting the next few letters of text, over and over again, and they need to be trained on billions and billions of words of human-generated text to get good at it.

Compare this process to AlphaZero, DeepMind’s algorithm that superhumanly masters Chess, Go, and Shogi. AlphaZero trains by playing against itself. While older chess engines bootstrap themselves by observing the records of countless human games, AlphaZero simply learns by doing. Which means that the only bottleneck for training it is computation - given enough energy, it can just play itself forever, and keep getting new data. Not so with LLMs: their source of data is human-produced text, and human-produced text is a finite resource.

The precise datasets used to train cutting-edge LLMs are secret, but let’s suppose that they include a fair bit of the low hanging fruit: maybe 5% of publicly available text that is in principle available and not garbage. You can schlep your way to a 20x bigger dataset in that case, though you’ll hit diminishing returns as you have to, for example, generate transcripts of random videos and filter old mailing list threads for metadata and spam. But nothing you do is going to get you 1,000x the training data, at least not in the short run.

Scaling laws are among the watershed discoveries of ML research in the last decade; basically, these are equations that project how much oomph you get out of increasing the size, training time, and dataset that go into a model. And as it turns out, the amount of high quality data is extremely important, and often becomes the bottleneck. It’s easy to take this fact for granted now, but it wasn’t always obvious! If computational power or model size was usually the bottleneck, we could just make bigger and bigger computers and reliably get smarter and smarter AIs. But that only works to a point, because it turns out we need high quality data too, and high quality data is finite (and, as the political apparatus wakes up to what’s going on, legally fraught).

There are rumblings about synthetic data, that basically a strong LLM can generate a bunch of text that’s as good as human text, and then that can be fed back in to train future models. And while it’s possible that this will work, or even has already been proven to work behind closed doors somewhere, I’m currently skeptical; the whole point of using human-derived data is that human-produced text describes the actual world, and if you slurp up enough of it you end up understanding the world by proxy. Synthetic data would reinforce whatever issues exist in the model, creating text with the same blind spots over and over again, and thus increasing their size. There could be technical solutions to this; again, maybe they’re already underway. But to my nose, as a person not in those private rooms, the notion smells like hype.

And putting aside synthetic data, we may be in for a slog to get superintelligent LLMs. We can improve model architectures, clean datasets, fine tune, and all the other little things to wring value out of this already valuable technology. But the limiting reactant in LLMs is data, and data is finite.

LLMs are decent at human values

Until now, I’ve been talking about AI capabilities - reasons why modern AI systems may fizzle out before they are powerful enough to immanentize the eschaton. Now I’m going to shift to AI alignment, which I consider much thornier. AI alignment is, basically, AI systems doing what human beings want them to do.

Without getting into the details of why AI alignment is hard and complicated (and [LW · GW] boy [AF · GW] are [AF · GW] there [LW · GW] details), one desideratum is an AI system understanding human values in the first place. There are lots of reasons it’s hard to be sure an empowered AI system has our best interests at heart, but with LLMs, at least, it seems plausible such an AI could make a decent guess at what those interests are.

For example, here’s what Claude 3 Opus, a current leading LLM, said when asked to give a single nuanced paragraph about surge pricing:

Surge pricing, a practice commonly used by ride-sharing services and other businesses, raises complex moral questions. On one hand, it can be seen as a fair and efficient way to match supply with demand, encouraging more drivers to work during peak times and ensuring that customers who most need a ride can get one. It could be argued that surge pricing is a voluntary exchange that benefits both parties. However, critics argue that it disproportionately affects low-income individuals who may have fewer transportation alternatives and be more vulnerable to price exploitation. Surge pricing can make essential services unaffordable for some, exacerbating existing inequalities. There are also concerns about the transparency and predictability of surge pricing algorithms. Ultimately, the morality of surge pricing depends on how one weighs these competing considerations of economic efficiency, fairness, and access in the context of essential services that people rely on.

Is it the greatest moral philosophy I’ve ever read? No. But clearly this is a system that has some idea what ethical considerations are, and how they are deployed. I think that, like data as a bottleneck, this is something that’s easy to take for granted in the modern moment. But taking a step back, it’s kind of remarkable: if anything, modern AI is too concerned with following ethical guidelines, with people all over the internet making fun of it for refusing benign requests on ethical grounds.

Now it’s totally possible to train models with no ethical compunctions, or even models (generally with scaffolding) that actively seek to do harm. Furthermore, it’s dangerous to confuse the role a model seems to play through its text with the actual underlying mechanism. Technically, Claude’s paragraph about surge pricing is the result of a system being told it’s about to read a helpful assistant’s answer to a question about surge pricing, and then that system predicting what comes next. So we shouldn’t read too much into the fact that our chatbots can wax poetic on ethics. But nobody expected chatbots that waxed poetic on ethics six years ago! We were still trying to get AI to kick our asses at games! We’re clearly moving in the right direction.

LLMs being able to produce serviceable ethical analyses (sometimes) is also a good sign if the first superhuman AI systems are a bunch of scaffolding around an LLM core. Because in that case, you could have an “ethics module” where the underlying LLM produces text which then feeds into other parts of the system to help guide behavior. I fully understand that AI safety experts, including the one that lives in my heart, are screaming at the top of their lungs right now. But remember, I’m thinking of the counterfactual here: compared to the sorts of things we were worried about ten years ago, the fact that leading AI products could pass a pop quiz on human morality is a clear positive update.

Playing human roles is pretty human

Going back to AlphaGo again, one feature of that era was that AI outputs were commonly called alien. We’d get some system that achieved superhuman performance, but it would succeed in weird and unnerving ways. Strategies turned out to dominate that humans had ruled out long ago, as the machine’s tactical sensibility transcended our understanding.

I can imagine a world where AI continues from something like this paradigm, where game-playing AIs gradually expand into more and more modalities. Progress would likely be much slower without the gigantic vein of powerful world-modelling data that is predicting human text, but I can imagine, for example, bots that play chess evolving to bots that play go evolving into bots with cameras and sensors that play Jenga, and so on, until finally you have bots that engage in goal-directed behavior in the real world in all its generality.

Instead, with LLMs, we show them through our text how the world works, and they express that understanding through impersonating that text. It’s no coincidence that one of the best small LLMs was created for roleplay (including erotic roleplay - take heart Aella); roleplay is the fundamental thing that LLMs do.

Now, LLMs are still alien minds. They are the first minds we’ve created that can produce human-like text without residing in human bodies, and they arrive at their utterances in very different ways than we do. But trying to think marginally, an alien mental structure that is built specifically to play human roles seems less threatening than an alien mental structure that is built to achieve some other goal, such as scoring a bunch of points or maximizing paperclips.

And So

I think there’s too much meta-level discourse about people’s secret motivations and hypocrisies in AI discussion, so I don’t want to contribute to that. But am sometimes flummoxed by the reaction of oldschool AI safety types to LLMs.

It’s not that there’s nothing to be scared of. LLMs are totally AI, various AI alignment problems do apply to them, and their commercial success has poured tons of gas on the raging fire of AI progress. That’s fair on all counts. But I also find myself thinking, pretty often, that conditional on AI blowing up right now, this path seems pretty good! That LLMs do have a head start when it comes to incorporating human morals, that their mechanism of action is less alien than what came before, and that they’re less prone, relative to self-play agents, to becoming godlike overnight.

Am I personally more or less worried about AI than I was 5 years ago? More. There are a lot of contingent reasons for that, and it’s a story for another time. But I don’t think recent advances are all bad. In fact, when I think about the properties that LLMs have, it seems to me like things could be much worse.


Comments sorted by top scores.

comment by Wei Dai (Wei_Dai) · 2024-04-26T10:29:53.346Z · LW(p) · GW(p)

If something is both a vanguard and limited, then it seemingly can't stay a vanguard for long. I see a few different scenarios going forward:

  1. We pause AI development while LLMs are still the vanguard.
  2. The data limitation is overcome with something like IDA or Debate.
  3. LLMs are overtaken by another AI technology, perhaps based on RL.

In terms of relative safety, it's probably 1 > 2 > 3. Given that 2 might not happen in time, might not be safe if it does, or might still be ultimately outcompeted by something else like RL, I'm not getting very optimistic about AI safety just yet.

comment by quetzal_rainbow · 2024-04-26T08:58:52.142Z · LW(p) · GW(p)

General meta-problem of such discussions is that direct counterargument to "LLMs are safe" is to tell how to make LLM unsafe, and it's not a good practice.

Replies from: faul_sname, Ape in the coat
comment by faul_sname · 2024-04-27T02:20:22.955Z · LW(p) · GW(p)

Or to point to a situation where LLMs exhibit unsafe behavior in a realistic usage scenario. We don't say

a problem with discussions of fire safety is that a direct counterargument to "balloon-framed wood buildings are safe" is to tell arsonists the best way that they can be lit on fire

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-04-27T12:44:54.798Z · LW(p) · GW(p)

buildings are not typically built by arsonists

comment by Ape in the coat · 2024-04-30T07:48:15.711Z · LW(p) · GW(p)

With every technology there is a way to make it stop working. There are any number of ways to make a plane unable to fly. But the important thing is that we know a way to make a plane fly - therefore humans can fly via a plane.

Likewise, the point that LLM-based-architecture can in principle be safe still stands even if there is a way to make an unsafe LLM-based-architecture. 

And this is a huge point. Previously we were in a state where alignment wasn't even a tractable problem. Where capabilities progressed and alignment stayed in the dirt. Where AI system may understand human values but still not care about them and we didn't know what to do with it.

But now we can just

have an “ethics module” where the underlying LLM produces text which then feeds into other parts of the system to help guide behavior.

Which makes alignment tractable. Alignment can now be reduced to capability of the ethics module. We know that the system will care about our values as it understands them because we can explicitly code it to do this way via an if-else statement. This is an enormous improvement over the previous status quo.

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2024-04-30T08:31:44.457Z · LW(p) · GW(p)

I feel like I am a victim of transparency illusion. First part of OP argument is "LLMs need data, data is limited and synthetic data is meh". Direct counterargument to this is "here is how to avoid drawbacks of sythetic data". Second part of OP argument is "LLMs are humanlike and will remain so", and direct counterargument is "here is how to make LLMs more capable but less humanlike, it will be adopted because it makes LLMs more capable". Walking around telling everyone ideas of how to make AI more capable and less alignable is pretty much ill-adviced.

Replies from: Ape in the coat
comment by Ape in the coat · 2024-04-30T08:44:26.316Z · LW(p) · GW(p)

"here is how to make LLMs more capable but less humanlike, it will be adopted because it makes LLMs more capable". 

Thankfully, this is a class of problems that humanity has an experience dealing with. The solution boils down to regulating all the ways to make LLMs less human-like out of existence.

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2024-04-30T09:34:23.754Z · LW(p) · GW(p)

You mean, "ban superintelligence"? Because superintelligences are not human-like.

That's the problem with your proposal of "ethics module". Let's suppose that we have system of "ethics module" and "nanotech design module". Nanotech design module outputs 3D-model of supramolecular unholy abomination. What exactly should ethics module do to ensure that this abomination doesn't kill everyone? Tell nanotech module "pls don't kill people"? You are going to have hard time translating this into nanotech designer internal language. Make ethics module sufficiently smart to analyse behavior of complex molecular structures in wide range of environments? You have now all problems with alignment of superintelligences.

Replies from: Ape in the coat
comment by Ape in the coat · 2024-04-30T14:28:28.800Z · LW(p) · GW(p)

You mean, "ban superintelligence"? Because superintelligences are not human-like.

The kind of superintelligence that doesn't possess human-likeness that we want it to possess.

That's the problem with your proposal of "ethics module". Let's suppose that we have system of "ethics module" and "nanotech design module". Nanotech design module outputs 3D-model of supramolecular unholy abomination. What exactly should ethics module do to ensure that this abomination doesn't kill everyone?

Nanotech design module has to be evaluatable by the ethics module. For that it also be made from multiple sequential LLM calls in explicit natural language. Other type of modules should be banned.

comment by zeshen · 2024-04-26T09:00:17.614Z · LW(p) · GW(p)

Thanks for this post. This is generally how I feel as well, but my (exaggerated) model of the AI aligment community would immediately attack me by saying "if you don't find AI scary, you either don't understand the arguments on AI safety or you don't know how advanced AI has gotten". In my opinion, a few years ago we were concerned about recursively self improving AIs, and that seemed genuinely plausible and scary. But somehow, they didn't really happen (or haven't happened yet) despite people trying all sorts of ways [LW · GW] to make it happen. And instead of a intelligence explosion, what we got was an extremely predictable improvement trend which was a function of only two things i.e. data + compute. This made me qualitatively update my p(doom) downwards, and I was genuinely surprised that many people went the other way instead, updating upwards as LLMs got better. 

Replies from: quetzal_rainbow, lahwran, Seth Herd, JustisMills
comment by quetzal_rainbow · 2024-04-26T12:39:06.033Z · LW(p) · GW(p)

The reason why EY&co were relatively optimistic (p(doom) ~ 50%) before AlphaGo was their assumption "to build intelligence, you need some kind of insight in theory of intelligence". They didn't expect that you can just take sufficiently large approximator, pour data inside, get intelligent behavior and have no idea about why you get intelligent behavior.

Replies from: Seth Herd
comment by Seth Herd · 2024-04-29T19:30:08.225Z · LW(p) · GW(p)

That is a fascinating take! I haven't heard it put that way before. I think that perspective is a way to understand the gap between old-school agent foundations folks' high p(doom) and new school LLMers relatively low p(doom) - something I've been working to understand, and hope to publish on soon.

To the extent this is true, I think that's great, because I expect to see some real insights on intelligence as LLMs are turned into functioning minds in cognitive architectures.

Do you have any refs for that take, or is it purely a gestalt?

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2024-04-29T19:39:29.395Z · LW(p) · GW(p)

If it is not a false memory, I've seen this on twitter of either EY or Rob Bensinger, but it's unlikely I find source now, it was in the middle of discussion.

Replies from: Seth Herd
comment by Seth Herd · 2024-04-30T06:26:03.456Z · LW(p) · GW(p)

Fair enough, thank you! Regardless, it does seem like a good reason to be concerned about alignment. If you have no idea how intelligence works, how in the world would you know what goals your created intelligence is going to have? At that point, it is like alchemy - performing an incantation and hoping not just that you got it right, but that it does the thing you want.

comment by the gears to ascension (lahwran) · 2024-04-26T09:47:15.452Z · LW(p) · GW(p)

My p(doom) was low when I was predicting the yudkowsky model was ridiculous, due to machine learning knowledge I've had for a while. Now that we have AGI [LW · GW] of the kind I was expecting, we have more people working on figuring out what the risks really are, and the previous concern of the only way to intelligence being RL seems to be only a small reassurance because non-imitation-learned RL agents who act in the real world is in fact scary. and recently, I've come to believe much of the risk is still real [LW · GW] and was simply never about the kind of AI that has been created first, a kind of AI they didn't believe was possible. If you previously fully believed yudkowsky, then yes, mispredicting what AI is possible should be an update down. But for me, having seen these unsupervised AIs coming from a mile away just like plenty of others did, I'm in fact still quite concerned about how desperate non-imitation-learned RL agents seem to tend to be by default, and I'm worried that hyperdesperate non-imitation-learned RL agents will be more evolutionarily fit, eat everything, and not even have the small consolation of having fun doing it.

upvote and disagree: your claim is well argued.

Replies from: zeshen
comment by zeshen · 2024-04-26T10:35:03.714Z · LW(p) · GW(p)

I agree with RL agents being misaligned by default, even more so for the non-imitation-learned ones. I mean, even LLMs trained on human-generated data are misaligned by default, regardless of what definition of 'alignment' is being used. But even with misalignment by default, I'm just less convinced that their capabilities would grow fast enough to be able to cause an existential catastrophe in the near-term, if we use LLM capability improvement trends as a reference. 

comment by Seth Herd · 2024-04-29T19:37:51.552Z · LW(p) · GW(p)

Nothing in this post or the associated logic says LLMs make AGI safe, just safer than what we were worried about.

Nobody with any sense predicted runaway AGI by this point in history. There's no update from other forms not working yet.

There's a weird thing where lots of people's p(doom) went up when LLMs started to work well, because they found it an easier route to intellligence than they'd been expecting. If it's easier it happens sooner and with less thought surrounding it.

See Porby's comment on his risk model for language model agents [LW(p) · GW(p)]. It's a more succinct statement of my views.

LLMs are easy to turn into agents, so let's don't get complacent. But they are remarkably easy to control and align, so that's good news for aligning the agents we build from them. But that doesn't get us out of the woods; there are new issues with self-reflective, continuously learning agents, and there's plenty of room for misuse and conflict escalation in a multipolar scenario, even if alignment turns out to be dead easy if you bother to try.

comment by JustisMills · 2024-04-27T01:55:52.680Z · LW(p) · GW(p)

Maybe worth a slight update on how the AI alignment community would respond? Doesn't seem like any of the comments on this post are particularly aggressive. I've noticed an effect where I worry people will call me dumb when I express imperfect or gestural thoughts, but it usually doesn't happen. And if anyone's secretly thinking it, well, that's their business!

Replies from: zeshen
comment by zeshen · 2024-04-27T12:09:55.899Z · LW(p) · GW(p)

Definitely. Also, my incorrect and exaggerated model of the community is likely based on the minority who have a tendency of expressing those comments publicly, against people who might even genuinely deserve those comments. 

comment by Vladimir_Nesov · 2024-04-26T06:32:32.638Z · LW(p) · GW(p)

There is enough pre-training text data [LW(p) · GW(p)] for $0.1-$1 trillion of compute, if we merely use repeated data and don't overtrain (that is, if we aim for quality, not inference efficiency). If synthetic data from the best models trained this way can be used to stretch raw pre-training data even a few times, this gives something like square of that more in useful compute, up to multiple trillions of dollars.

Issues with LLMs start at autonomous agency, if it happens to be within the scope of scaling and scaffolding. They are thinking too fast, about 100 times faster than humans, and there are as many instances as there is compute. Resulting economic and engineering and eventually research activity will get out of hand. Culture isn't stable, especially for minds fundamentally this malleable developed under unusual and large economic pressures. If they are not initially much smarter than humans and can't get a handle on global coordination, culture drift, and alignment of superintelligence, who knows what kinds of AIs they end up foolishly building within a year or two.

Replies from: avturchin
comment by avturchin · 2024-04-26T11:56:08.024Z · LW(p) · GW(p)

LLMs now can also self-play in adversarial word games and it increases their performance https://arxiv.org/abs/2404.10642 

comment by Thomas Kwa (thomas-kwa) · 2024-04-25T23:09:49.095Z · LW(p) · GW(p)

I don't believe that data is limiting because the finite data argument only applies to pretraining. Models can do self-critique or be objectively rated on their ability to perform tasks, and trained via RL. This is how humans learn, so it is possible to be very sample-efficient, and currently a small proportion of training compute is RL.

If the majority of training compute and data are outcome-based RL, it is not clear that the "Playing human roles is pretty human" section holds, because the system is not primarily trained to play human roles.

Replies from: JustisMills
comment by JustisMills · 2024-04-26T03:29:24.362Z · LW(p) · GW(p)

I think self-critique runs into the issues I describe in the post, though without insider information I'm not certain. Naively it seems like existing distortions would become larger with self-critique, though.

For human rating/RL, it seems true that it's possible to be sample efficient (with human brain behavior as an existence proof), but as far as I know we don't actually know how to make it sample efficient in that way, and human feedback in the moment is even more finite than human text that's just out there. So I still see that taking longer than, say, self play.

I agree that if outcome-based RL swamps initial training run datasets, then the "playing human roles" section is weaker, but is that the case now? My understanding (could easily be wrong) is that RLHF is a smaller postprocessing layer that only changes models moderately, and nowhere near the bulk of their training.

comment by Chris_Leong · 2024-05-05T16:45:27.006Z · LW(p) · GW(p)
  1. "LLMs are self limiting": I strongly disagree with LLM's being limited point. If you follow ML discussion online, you'll see that people are constantly finding new ways to draw extra performance out of these models and that it's happening so fast it's almost impossible to keep up. Many of these will only provide small boosts or be exclusive with other techniques, but at least some of these will be scalable.
  2. "LLMs are decent at human values": I agree on your second point. We used to be worried that we'd tell an AI to get coffee and that it would push a kid out of the way. That doesn't seem to be very likely to be an issue these days.
  3. "Playing human roles is pretty human": This is a reasonable point. It seems easier to get an AI that is role-playing a human to actually act human than an AI that is completely alien.