Near-mode thinking on AI

post by Olli Järviniemi (jarviniemi) · 2024-08-04T20:47:28.085Z · LW · GW · 8 comments

Contents

  I. Prerequisites for scheming
  II. A failed prediction
  III. Mundane superhuman capabilities
  IV. Automating alignment research
None
8 comments

There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover [? · GW]". 

And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode. 

In this post, I share a few concrete examples about my experiences with this change-of-orientation.

I. Prerequisites for scheming

Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident.

Past-me hadn't really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that:

Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard. 

It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these). 

But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing.

II. A failed prediction

There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high.

Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since.

Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing about the news, I (obviously) sat down with pen and paper and thought: Why was I wrong? How could I have thought that faster? [LW · GW]

One mistake is that I thought it was not-that-likely that the big labs would make a serious attempt on this. But in hindsight, I shouldn't have been shocked that, having seen OpenAI do formal theorem proving and DeepMind doing competitive programming and math olympiad geometry, they just might be looking at the IMO as well.

But for the more important insight: The history of AI is littered with the skulls of people who claimed that some task is AI-complete, when in retrospect this has been obviously false. And while I would have definitely denied that getting IMO gold would be AI-complete, I was surprised by the narrowness of the system DeepMind used.

(I'm mature enough to not be one of those people who dismiss DeepMind by saying that all they did was Brute Force and not Real Intelligence, but not quite mature enough to not poke at those people like this.)

I think I was too much in the far-mode headspace of one needing Real Intelligence - namely, a foundation model stronger than current ones - to do well on the IMO, rather than thinking near-mode "okay, imagine DeepMind took a stab at the IMO; what kind of methods would they use, and how well would those work?"

Even with this meta-level update I wouldn't have in advance predicted that IMO will fall just about now - indeed, I had (half-heartedly) considered the possibility of doing formal theorem proving+RL+tree-search before the announcement - but I would have been much less surprised. I also updated away from a "some tasks are AI-complete" type of view, towards "often the first system to do X will not be the first systems to do Y".[2]

III. Mundane superhuman capabilities

I've come to realize that being "superhuman" at something is often much more mundane than I've thought. (Maybe focusing on full superintelligence - something better than humanity on practically any task of interest - has thrown me off.)

Like:

As a consequence, I now think that the first transformatively useful AIs could look behaviorally quite mundane. (I do worry about later in the game superhuman AIs being better in ways humans cannot comprehend, though.)

IV. Automating alignment research

For a long time, I didn't take the idea of automating alignment research seriously. One reason for my skepticism was that this is just the type of noble good-for-pr goal I would expect people to talk about, regardless of whether it's feasible and going to happen or not. Another reason was that I thought people were talking about getting AIs to do conceptual foundational research like Embedded Agency [LW · GW], which seemed incredibly difficult to me.

Whereas currently I see some actually feasible seeming avenues for doing safety research. Like, if I think about the recent work I've looked at in situational awareness, out-of-context reasoning, dangerous capability evaluations, AI control, hidden cognition and tons of other areas, I really don't see a fundamental reason why you couldn't speed up such research massively. You can think of a pipeline like

And sure enough, this would totally fail for dozens of reasons, there are dozens of things you could do better, and dozens of questions about whether you can do useful versions of this safely or not. I'm also talking about (relatively easily verifiable) empirical research here, which one might argue is not sufficient.

Nevertheless, now that I have this concrete near-mode toy answer to "okay, imagine Anthropic took a stab at automating alignment research; what kind of methods would they use?", it's easier for me to consider the idea of automating alignment research seriously.

  1. ^

    Also, many of the relevant questions are not about pure capability, but also whether the model in fact uses those capabilities in the postulated way, and about murkier things like the developmental trajectory of scheming.

  2. ^

    While keeping in mind that LLMs solved a ton of notoriously hard problems in AI in one swoop, and foundation models sure get lots of different capabilities with scale.

8 comments

Comments sorted by top scores.

comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-04T23:30:16.911Z · LW(p) · GW(p)

You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

Yup; and not only this, but many parts of the workflow have already been tested out (e.g. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models; Generation and human-expert evaluation of interesting research ideas using knowledge graphs and large language models; LitLLM: A Toolkit for Scientific Literature Review; Acceleron: A Tool to Accelerate Research Ideation; DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning; Discovering Preference Optimization Algorithms with and for Large Language Models) and it seems quite feasible to get enough reliability/consistency gains to string these together and get ~the whole (post-training) prosaic alignment research workflow loop going, especially e.g. with improvements in reliability from GPT-5/6 and more 'schlep' / 'unhobbling'.

Replies from: bogdan-ionut-cirstea, bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-13T07:03:08.423Z · LW(p) · GW(p)

And indeed, here's what looks like a prototype: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.

And already some potential AI safety issues: 'We have noticed that The AI Scientist occasionally tries to increase its chance of success, such as modifying and launching its own execution script! We discuss the AI safety implications in our paper.

For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.'

comment by erhora · 2024-08-08T15:26:04.498Z · LW(p) · GW(p)

You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update for AI safety if it worked.

It does seem that for automation to have a disproportionately large impact for AI alignment it would have to be specific to the research methods used in alignment. This may not necessarily mean automating the foundational and conceptual research you mention, but I do think it necessarily does not look like your suggest pipeline.

Two examples might be: a philosophically able LLM that can help you de-confuse your conceptual/foundational ideas; automating mech-interp (e.g. existing work on discovering and interpreting features) in a way that does not generalise well to other AI research directions.

Replies from: bogdan-ionut-cirstea
comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-08T19:19:04.161Z · LW(p) · GW(p)

At least some parts of automated safety research are probably differentially accelerated though (vs. capabilities), for reasons I discuss in the appendix of this presentation (in summary, that a lot of prosaic alignment research has [differentially] short horizons, both in 'human time' and in 'GPU time'): https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link.

Large parts of interpretability are also probably differentially automatable (as is already starting to happen, e.g. https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse [LW · GW]; https://multimodal-interpretability.csail.mit.edu/maia/), both for task horizon reasons (especially if combined with something like SAEs, which would help by e.g. leading to sparser, more easily identifiable circuits / steering vectors, etc.) and for (more basic) token cheapness reasons: https://x.com/BogdanIonutCir2/status/1819861008568971325

comment by Review Bot · 2024-08-08T01:15:06.888Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by Mateusz Bagiński (mateusz-baginski) · 2024-08-06T07:31:17.724Z · LW(p) · GW(p)

There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025.

correction: it's by the end of 2025

Replies from: Jozdien
comment by Jozdien · 2024-08-06T11:26:58.584Z · LW(p) · GW(p)

Further correction / addition: the AI needs to have been built before the 2025 IMO, which is in July 2025.