On the future of language models

post by owencb · 2023-12-20T16:58:28.433Z · LW · GW · 17 comments

17 comments

Comments sorted by top scores.

comment by Seth Herd · 2023-12-22T21:35:48.267Z · LW(p) · GW(p)

This is a fantastic post. Big upvote.

I couldn't agree more with your opening and ending thesis, which you put ever so gently:

the current portfolio is over-indexed on work which treats “transformative AI” as a black box

It seems obvious to me that trying to figure out alignment without talking about AGI designs is going to be highly confusing. It also seems likely to stop short of a decent estimate of the difficulty. It's hard to judge whether a plan is likely to fail when there's no actual plan to judge.  And it seems like any actual plan for alignment would reference a way AGI might use knowledge and make decisions.

 

WRT the language model agent route, you've probably seen my posts, which are broadly in agreement with your take:

Capabilities and alignment of LLM cognitive architectures [LW · GW]

Internal independent review for language model agent alignment [AF · GW]

The second focuses more on the range of alignment techniques applicable to LMAs/LMCAs. I wind up rather optimistic, particularly when the target of alignment is corrigibility or DWIM-and-check [LW · GW].

It seems like even if LMAs achieve AGI, they might progress slowly beyond the human-level source of the LLM training. That could be a really good thing. I want to think about this more. 

I'm unsure how much to publish on possible routes. Right now it seems to me that advancing progress on LMAs is actually a good thing, since they're more transparent and directable than any other AGI approach I can think of. But I don't trust my own judgment when there's been so little discussion from the hardcore alignment-is-hard-crowd.

It boggles my mind that posts like this, forecasting real routes to AGI and alignment, don't get more attention and discussion. What exactly are people hoping for as alignment solutions if not work like this?

 

Again, great post, keep it up.

comment by faul_sname · 2023-12-21T02:59:23.270Z · LW(p) · GW(p)

Good post!

In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance

You address this to some extent later on in the post, but I think it's worth emphasizing the extent to which this specifically holds in the context of language models trained on human outputs. If you take a transformer with the same architecture but train it on a bunch of tokenized output streams of a specific model of weather station, it will learn to predict the next token of the output stream of weather stations, at a level of accuracy that does not particularly have to do with how good humans are at that task.

And in fact for tasks like "produce plausible continuations of weather sensor data, or apache access logs, or stack traces, or nucleotide sequences" the performance of LLMs does not particularly resemble the performance of humans on those tasks.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2023-12-22T13:30:11.710Z · LW(p) · GW(p)

I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.

I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.

Agree. I think there are pretty strong reasons to believe that with a concerted effort, we can very likely (> 90% probability) build safe scaffolded LM agents capable of automating ~all human-level alignment research while also being incapable of doing non-trivial consequentialist reasoning in a single forward pass. Also (still) looking for collaborators for this related research agenda on evaluating the promise of automated alignment research

comment by RogerDearnaley (roger-d-1) · 2023-12-22T09:34:21.324Z · LW(p) · GW(p)

In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance

For a more detailed analysis of how this problem could be overcome but why doing so is unlikely to be a fast process, see my post LLMs May Find it Hard to FOOM [LW · GW]. (Later parts of your post have some overlap with that, but there are some specifics such as conditioning and extrapolation that you don't discuss, so readers with find some more useful content there.)

comment by Oliver Sourbut · 2023-12-21T11:49:37.977Z · LW(p) · GW(p)

I think there are two really important applications, which have the potential to radically reshape the world:

  • Research
    • The ability to develop and test out new ideas, adding to the body of knowledge we have accumulated
    • Automating this would be a massive deal for the usual reasons about feeding back into growth rates, facilitating something like a singularity
      • In particular the automation of further AI development is likely to be important
    • There are many types of possible research, and automation may look quite different for e.g. empirical medical research vs fundamental physics vs political philosophy
      • The sequence in which we get the ability to automate different types of research could be pretty important for determining what trajectory the world is on
  • Executive capacity
    • The ability to look at the world, form views about how it should be different, and form and enact plans to make it different
    • (People sometimes use “agency” to describe a property in this vicinity)
    • This is the central thing that leads to new things getting done in the world. If this were fully automated we might have large fully autonomous companies building more and more complex things towards effective purposes.
    • This is also the thing which, (if/)when automated, creates concerns about AI takeover risk.

 

I agree. I tentatively think (and have been arguing in private for a while) that these are 'basically the same thing'. They're both ultimately about

  • forming good predictions on the basis of existing models
  • efficiently choosing 'experiments' to navigate around uncertainties
    • (and thereby improve models!)
  • using resources (inc. knowledge) to acquire more resources

They differ (just as research disciplines differ from other disciplines, and executing in one domain differs from other domains) in the specifics, especially on what existing models are useful and the 'research taste' required to generate experiment ideas and estimate value-of-information. But the high level loop is kinda the same.

Unclear to me what these are bottlenecked by, but I think the latent 'research taste' may be basically it (potentially explains why some orgs are far more effective than others, why talented humans take a while to transfer between domains, why mentorship is so valuable, why the scientific revolution took so long to get started...?)

Replies from: Oliver Sourbut, Oliver Sourbut
comment by Oliver Sourbut · 2023-12-21T12:10:02.536Z · LW(p) · GW(p)

In particular, the 'big two' are both characterised by driving beyond the frontier of the well-understood which means by necessity they're about efficiently deliberately setting up informative/serendipitous scenarios to get novel informative data. When you're by necessity navigating beyond the well-understood, you have to bottom out your plans with heuristic guesses about VOI, and you have to make plans which (at least sometimes) have good VOI. Those have to ground out somewhere, and that's the 'research taste' at the system-1-ish level.

comment by Oliver Sourbut · 2023-12-21T11:53:59.085Z · LW(p) · GW(p)

I think it’s most likely that for a while centaurs will significantly outperform fully automated systems

Agree, and a lot of my justification comes from this feeling that 'research taste' is quite latent, somewhat expensive to transfer, and a bottleneck for the big 2.

comment by Andy_McKenzie · 2023-12-21T01:34:24.366Z · LW(p) · GW(p)

Very high-effort, comprehensive post. Any interest in making some of your predictions into markets on Manifold or some other prediction market website? Might help get some quantifications. 

comment by RogerDearnaley (roger-d-1) · 2023-12-22T10:22:40.233Z · LW(p) · GW(p)

At tasks like “give a winning chess move”, we can generate high quality synthetic data so that it’s likely that we can finetune model performance to exceed top human intuitive play.

With some more effort, this also applies to "prove this mathematical conjecture" (using automated proof checkers like Lean) and (with suitably large and well-deigned automated test suites) also to "write code to solve this problem". These seem like areas broad enough that scaling them up to far superhuman levels, as well as being inherently useful, might also carry over towards other tasks requiring rational and logical thinking. Also, this would probably be ab ideal forum in which to work on solutions to the 'drunkenness' issue.

comment by RogerDearnaley (roger-d-1) · 2023-12-22T10:10:09.737Z · LW(p) · GW(p)

1) seems like mostly a sideshow — while we could get agency from this, unless people are trying hard I don’t think it would tend to find especially competent agents to emulate, and may not have a good handle on what’s going on in the world.

I'm very puzzled by this opinion. If we can reduce the 'drunkenness' issue, this type of agency scales to at least the competence level of most competent humans (or indeed, fictional characters) in existence, and probably at least some distance beyond by extrapolation (and run cheaply in faster than realtime). These agents are not safe: humans are not fully aligned to human values [LW · GW], power corrupts, and Joseph Stalin was not well aligned with the needs to the citizenry of Russia. This seems like plenty to be concerned about, rather than a sideshow. Now, the ways in which they're not aligned are at least ones we have a good intuitive and practical understanding of, and some partial solutions for controlling (things like love, guilt, salaries, and law enforcement).

comment by Oliver Sourbut · 2023-12-21T13:32:44.236Z · LW(p) · GW(p)

I’m hazier on the details of how this would play out (and a bit sceptical that it would enable a truly runaway feedback loop), but more sophisticated systems could help to gather the real-world data to make subsequent finetuning efforts more effective.

On the contrary, I think proactive gathering of data is very plausibly the bottleneck, and (smarts) -> (better data gathering) -> (more smarts) is high on my list of candidates for the critical feedback loop.

In a world where the 'big two' (R&D and executive capacity) are characterised by driving beyond the frontier of the well-understood [LW(p) · GW(p)] it's all about data gathering and sample-efficient incorporation of the data.

FWIW I don't think vanilla 'fine tuning' necessarily achieves this, but coupled with retrieval augmented generation and similar scaffolding, incorporation of new data becomes more fluent.

comment by Roman Leventov · 2023-12-20T22:16:35.266Z · LW(p) · GW(p)

Notable techniques for getting value out of language models that are not mentioned:

Replies from: Roman Leventov, owencb
comment by Roman Leventov · 2023-12-20T22:36:18.792Z · LW(p) · GW(p)

Also, I would say, retrieval-augmented generation (RAG) is not just a mundane way to industrialise language model, but an important concept whose properties should be studied separately from scaffolding or fine-tuning or other techniques that I listed in the comment above.

comment by owencb · 2023-12-20T23:46:42.255Z · LW(p) · GW(p)

Thanks. At a first look at what you're saying I'm understanding these to be subcategories of using finetuning or scaffolding (in the case of leveraging semantic knowledge graphs) in order to get useful tools. But I don't understand the sense in which you think finetuning in this context has completely different properties. Do you mean different properties from the point where I discuss agency entering via finetuning? If so I agree.

(Apologies for not having thought this through in greater depth.)

Replies from: Roman Leventov
comment by Roman Leventov · 2023-12-21T17:19:41.288Z · LW(p) · GW(p)

I think you tied yourself too much to the strict binary classification that you invented (finetuning/scaffolding). You overgeneralise and your classification blocks the truth more than clarifies things.

All the different things that can be done by LLMs: tool use, scaffolded reasoning aka LM agents, RAG, fine-tuning, semantic knowledge graph mining, reasoning with semantic knowledge graph, finetuning for following "virtue" (persona, character, role, style, etc.), finetuning for model checking, finetuning for heuristics for theorem proving, finetuning for generating causal models, (what else?), just don't easily fit into two simple categories with the properties that are consistent within the category.

But I don't understand the sense in which you think finetuning in this context has completely different properties.

In the summary (note: I actually didn't read the rest of the post, I've read only the summary), you write something that implies that finetuning is obscure or un-interpretable:

From a safety perspective, language model agents whose agency comes from scaffolding look greatly superior than ones whose agency comes from finetuning

  • Because you can get an extremely high degree of transparency by construction

But this totally doesn't apply to these other variants of finetuning that I mentioned. If the LLM creates is a heuristic engine to generate mathematical proofs that are later verified with Lean, it just stops to make any sense to discuss how interpretable or transparent these theorem-proving or model-checking LLM-based heuristic engine.

comment by Ape in the coat · 2023-12-21T13:25:44.957Z · LW(p) · GW(p)

Strong upvote. We are definetely not talking enough about what Scaffolded Language Model Agents mean for AI alignment. They are the light of hope, interpretable by design systems with tractable alignment and slow take off potential.

One possibility that arises as part of a mixed takeoff is using machine learning to optimize for the most effective scaffolding.

This should be forbidden. Turning explicitly written in code scaffolding into another black box not only will greatly damage interpretability but also poses huge risks of accidentally creating a sentient entity without noticing it. Scaffolding for LMAs serve a very similar role to consciousness for humans, so we should be very careful in this regard.

comment by Review Bot · 2024-02-14T06:49:26.805Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?