Anti-Slop Interventions?

post by abramdemski · 2025-02-04T19:50:29.127Z · LW · GW · 8 comments

Contents

  AI Slop
  Coherence & Recursive Self-Improvement
  What's to be done?
None
8 comments

In his recent post arguing against AI Control research, John Wentworth argues that the median doom path goes through AI slop, rather than scheming [LW · GW]. I find this to be plausible. I believe this suggests a convergence of interests between AI capabilities research and AI alignment research.

Historically, there has been a lot of concern about differential progress [? · GW] amongst AI safety researchers (perhaps especially those I tend to talk to). Some research is labeled as "capabilities" while other is labeled as "safety" (or, more often, "alignment"[1]). Most research is dual-use in practice (IE, has both capability and safety implications) and therefore should be kept secret or disclosed carefully.

Recently, a colleague expressed concern that future AIs will read anything AI safety researchers publish now. Since the alignment of future AIs seems uncertain and even implausible, almost any information published now could be net harmful for the future.

I argued the contrary case, as follows: a weak form of recursive self-improvement has already started (in the sense that modern LLMs can usefully accelerate the next generation of LLMs in a variety of ways[2]). I assume that this trend will intensify as AI continues to get more useful. Humans will continue to keep themselves at least somewhat in the loop, but at some point, mistakes may be made (by either the AI or the humans) which push things drastically off-course. We want to avoid mistakes like that.

John spells it out more decisively:

The problem is that we mostly don’t die of catastrophic risk from early transformative AI at all. We die of catastrophic risk from stronger AI, e.g. superintelligence (in the oversimplified model). The main problem which needs to be solved for early transformative AI is to use it to actually solve the hard alignment problems of superintelligence

The key question (on my model) is: does publishing a given piece of information reduce or increase the probability of things going off-course?

Think of it like this. We're currently navigating foreign terrain with a large group of people. We don't have the choice of splitting off from the group; we expect to more-or-less share the same fate, whatever happens. We might not agree with the decision-making process of the group. We might not think we're currently on-course for a good destination. Sharing some sorts of information with the group can result in doom.[3] However, there will be many types of information which will be good to share.

AI Slop

AI slop is a generic derogative term for AI-generated content, due to it being easy to mass-produce low-quality content full of hallucinations[4], extra fingers, and other hallmarks of AI-generated content.

As the AI hype continues to increase, I've continued to attempt to use AI to accelerate my research. While it is obviously getting better, my experience is that it continues to be useful only as a sounding board. I find myself often falling into the habit of not even reading the AI outputs, because they have proven worse than useless: when I describe my technical problem and ask for a solution, I get something that looks plausible at first glance, but on close analysis, assumes what is to be proven in one of the proof steps. I'm not exactly sure why this is the case. Generating a correct novel proof should be hard, sure; but checking proofs is easier than generating them. Generating only valid proof steps should be relatively easy.[5] 

These AIs seem strikingly good at conversing about sufficiently well-established mathematics, but the moment I ask for something a little bit creative, the fluent competence falls apart.

Claude 3.5 was the first model whose proofs were good enough to fool me for a little while, rather than being obvious slop. The o1 model seems better, in the sense that its proofs look more convincing and it takes me longer to find the holes in the proofs. I haven't tried o3 yet, but early reports are that it hallucinates a lot, so I mostly expect it to continue the trend of being worse-than-useless in this way.[6]

I'm not denying that these models really are getting better in a broad sense. There's a general pattern that LLMs are much more useful for people who have a lower level of expertise in a field. That waterline continues to increase.[7]

However, as these models continue to get better, they seemingly continue to display a very high preference for convincingness over correctness when the two come into conflict. If this continues it is, plausibly, a big problem for the future.

Coherence & Recursive Self-Improvement

Recursive self-improvement (RSI) is a tricky business. One wrong move can send you teetering into madness. It is, in a broad sense, the business which leading AI labs are already engaged in.

Again quoting John:

First, some lab builds early transformatively-useful AI. They notice that it can do things like dramatically accelerate AI capabilities or alignment R&D. Their alignment team gets busy using the early transformative AI to solve the alignment problems of superintelligence. The early transformative AI spits out some slop, as AI does. Alas, one of the core challenges of slop is that it looks fine at first glance, and one of the core problems of aligning superintelligence is that it’s hard to verify; we can’t build a superintelligence to test it out on without risking death. Put those two together, add a dash of the lab alignment team having no idea what they’re doing [? · GW] because all of their work to date has focused on aligning near-term AI rather than superintelligence, and we have a perfect storm for the lab thinking they’ve solved the superintelligence alignment problem when they have not in fact solved the problem.

So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.

To avoid this sort of outcome, it seems like we need to figure out how to make models "coherent" in a fairly broad sense (related to some formal notions of coherence, eg probabilistic coherence, and also informal notions of coherence). Here are some important-seeming properties to illustrate what I mean:

  1. Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
  2. Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
  3. Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
  4. Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn't know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there's not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next -- hence we get hallucinations.

All of this is more-or-less a version of the metaphilosophy [? · GW] research agenda, framed in terms of current events in AI. We don't just need to orient AI towards are values; we need to orient AI towards (the best of) the whole human truth-seeking process, including (the best of) moral philosophy, philosophy of science, etc. 

What's to be done?

To my knowledge, we still lack a good formal model clarifying what it would even mean to solve the hardest parts of the AI safety problem (eg, the pointers problem [? · GW]). However, we do have a plausible formal sketch of metaphilosophy: Logical Induction![8]

Logical Induction comes with a number of formal guarantees about its reasoning process. This is something that cannot be said about modern "reasoning models" (which I think are a move in the wrong direction [LW · GW]).

Can we apply ideas from logical induction to improve the reasoning of modern AI? I think it is plausible. Should we? I think it is plausible.[9]

More generally, this post can be viewed as a continuation of the ideas I expressed in LLMs for Alignment Research: A Safety Priority? [LW · GW] and AI Craftsmanship [LW · GW]. I am suggesting that it might be time for safety-interested people to work on specific capabilities-like things, with an eye particularly towards capabilities which can accelerate AI safety research, and more generally, an eye towards reducing AI slop. 

I believe that scaling up current approaches is not sufficient; it seems important to me to instead understand the underlying causes of the failure modes we are seeing, and design approaches which avoid those failure modes. If we can provide a more-coherent alternative to the current paradigm of "reasoning models" (and get such an alternative to be widely adopted), well, I think that would be good.

Trying to prevent jailbreaking, avoid hallucinations,[4] get models to reason well, etc are not new ideas. What I see as new here is my argument that the interests of safety researchers and capabilities researchers are aligned on these topics. This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.

Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.

  1. ^

    I have come to prefer "AI safety" as the broader and more descriptive term for research intended to help reduce AI risk. The term "alignment" still has meaning to me, as a synonym for value-loading research, which aims to build agentic AI whose goal-directedness is aimed at human values. However, I think it is quite important to keep one's thoughts as close as possible to the main aim of one's research. To me, it seems that safety is the better aim than alignment. Alignment is one way to achieve safety, but may not be the best way.

  2. ^

    Approaches such as constitutional AI, RLHF, and deliberative safety use AI directly to help train AI. LLMs are also useful for programmers, so I imagine that they see some use for writing code at the AI labs themselves. More generally, researchers might have conversations with LLMs about their research.

  3. ^

    EG, maybe the majority of the group thinks that jumping off of cliffs is a good idea, so we don't want to tell the group the direction to the nearest cliff.

  4. ^

    One colleague of mine uses the term "confabulation" rather than the more common "hallucination" -- I think it is a much more fitting term. The so-called hallucinations are (1) in the output of the system rather than the input; confabulations are a behavior, whereas hallucinations are a sensory phenomenon; (2) verbal rather than visual; while hallucinations can be auditory or impact other senses, the central thing people think of is visual hallucinations. "Confabulation" calls to mind a verbal behavior, which fits the phenomenon being described very well.

    "Confabulation" also seems to describe some details of the phenomenon well; in particular, AI confabulation and human confabulation share patterns of motivated cognition: both will typically try to defend their confabulated stories, rather than conceding in the face of evidence.

    I recognize, unfortunately, that use of the term "hallucination" to describe LLM confabulation has become extremely well-established. However, thinking clearly about these things seems important, and using clear words to describe them aids such clarity.

    Ooh, I found someone else making the same point.

  5. ^

    I'm not saying "logic is simple, therefore generating only valid proof-steps should be simple" -- I understand that mathematicians skip a large number of "obvious" steps when they write up proofs for publication, such that fully formalizing proofs found in a randomly chosen math paper can be quite nontrivial. So, yes, "writing only valid proof steps" is much more complex than simply keeping to the rules of logic.

    Still, most proofs in the training data will be written for a relatively broad audience, so (my argument goes) fluency in discussing the well-established math in a given area should be about the level of skill needed for generating only valid proof steps. This is a strong pattern, useful for predicting the data. From this, I would naively predict that LLMs trying to write proofs would write a bunch of valid steps (perhaps including a few accidental mistakes, rather than strategic mistakes) and fail to reach the desired conclusion, rather than generating clever arguments.

    To me, the failure of this prediction requires some explanation. I can think of several possible explanations, but I am not sure which is correct.

  6. ^

    A colleague predicted that o3-pro will still generate subtly flawed proofs, but at that point I'll lose the ability to tell without a team of mathematicians. I disagree: a good proof is a verifiable proof. I can at least fall back on asking o3-pro to generate a machine-checkable version of the proof, and count it as a failure if it is unable to do so.

  7. ^

    Although: it is easy for people to overestimate how quickly that waterline is increasing. AI will naturally be optimized to pass the shallower tests of competence, and people will naturally be biased to make generalized predictions about its competence based on shallower tests. Furthermore, since most people aren't experts in most fields, Gell-Mann Amnesia leads to overestimation of AI.

  8. ^

    Readers might not be prepared to think about Logical Induction as a solution to metaphilosophy. I don't have the bandwidth to defend this idea in the current essay, but I hope to defend it at some point.

  9. ^

    The idea of mainstream AI taking inspiration from Logical Induction to generate capability insights is something that a number of people I know have considered to be a risk for some time; the argument being that it would be net-negative due to accelerating capabilities.

8 comments

Comments sorted by top scores.

comment by evhub · 2025-02-04T21:28:20.237Z · LW(p) · GW(p)

A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

comment by Towards_Keeperhood (Simon Skade) · 2025-02-04T20:17:20.855Z · LW(p) · GW(p)

This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.

Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.

My current guess:

I wouldn't expect much useful research to come from having published ideas. It's mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.

Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who might actually have some good ideas. And depending on how things play out if in a couple years there's some actual effort from the joined collection of the leading labs to align AI and they only have like 2-8 months left before competition will hit the AI improving AI dynamic quite hard, then you might go to the labs and share your ideas with them (while still trying to keep it closed within those labs - which will probably only work for a few months or a year or so until there's leakage).

Replies from: abramdemski
comment by abramdemski · 2025-02-04T21:55:52.592Z · LW(p) · GW(p)

Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

Replies from: Simon Skade
comment by Towards_Keeperhood (Simon Skade) · 2025-02-04T23:00:04.795Z · LW(p) · GW(p)

Can you link me to what you mean by John's model more precisely?

If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.

I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.

So like I think it's valuable to have AIs that are near the singularity be more rational. But I don't really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish?

Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).

comment by Towards_Keeperhood (Simon Skade) · 2025-02-04T20:31:26.041Z · LW(p) · GW(p)

What kind of alignment research do you hope to speed up anyway?

For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don't seem anywhere remotely close to being useful.

My guess would be for tiling agents theory neither but I haven't worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it's own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)

Or what is your alignment hope in more concrete detail?

Replies from: abramdemski
comment by abramdemski · 2025-02-04T20:48:07.742Z · LW(p) · GW(p)

Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.

I disagree with the statement "to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem" -- the central thing is to understand conditions under which one agent can justifiably trust another (with "trust" operationalized as whether one agent wants to modify the decision procedure of the other). If AI can't justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.

Replies from: Simon Skade
comment by Towards_Keeperhood (Simon Skade) · 2025-02-04T21:22:05.571Z · LW(p) · GW(p)

Thanks.

True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties?

Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then it's a bit more plausible. Like I think it ought to be feasible to get AIs that are extremely good at proving theorems and maybe also formalizing conjectures. Though I'd be a lot more pessimistic about finding good formal representations for describing/modelling ideas.

Do you think we are basically only bottlenecked on math so sufficient math skill could carry us to aligned AI, or only have some alignment philosophy overhang you want to formalize but then more philosophy will be needed?

Replies from: abramdemski
comment by abramdemski · 2025-02-04T21:53:20.886Z · LW(p) · GW(p)

I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.