Posts
Comments
They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].
An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.
Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.
> now that AI systems are already increasingly general
I want to point out that if you tried to quantify this properly, the argument falls apart (at least in my view). "All AI systems are increasingly general" would be false; there are still many useful but very narrow AI systems. "Some AI systems" would be true, but this highlights the continuing usefulness of the distinction.
One way out of this would be to declare that only LLMs and their ilk count as "AI" now, with more narrow machine learning just being statistics or something. I don't like this because of the commonality of methods between LLMs and the rest of ML; it is still deep learning (and in many cases, transformers), just scaled down in every way.
Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we're assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I'm considering is that even improving rationality of AIs by a lot could be good, if we could do it.
An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming!
Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”
I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.
So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.
Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards.
My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation.
This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?
Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times.
IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer.
(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)
I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons).
But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise.
I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?
What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at.
I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces.
I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally solve the bad-at-extrapolation problem (hallucinations, etc); they can instead do it in a way that goodharts on the sorts of feedback they're getting.
Alignment is broad enough that I can understand classifying this sort of failure as "alignment failure" but I don't think it is the most precise description.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
This does seem possible, but I don't find it probable. Self-improvement ideas can be rapidly tested for their immediate impacts, but checking their long-term impacts is harder. Therefore, AI slop can generate many non-working self-improvements that just get discarded and that's fine; it's the apparently-working self-improvement ideas that cause problems down the line. Similarly, the AI itself can more easily train on short-term impacts of proposed improvements; so the AI might have a lot less slop when reasoning about these short-term impacts, due to getting that feedback.
(Notice how I am avoiding phrasing it like "the sloppy AI can be good at capabilities but bad at alignment because capabilities are easier to train on than alignment, due to better feedback". Instead, focusing on short-term impacts vs long-term impacts seems to carve closer to the joints of reality.)
Sloppy AIs are nonetheless fluent with respect to existing knowledge or things that we can get good-quality feedback for, but have trouble extrapolating correctly. Your scenario, where the sloppy AI can't help with self-improvement of any kind, suggests a world where there is no low-hanging fruit via applying existing ideas to improve the AI, or applying the kinds of skills which can be developed with good feedback. This seems possible but not especially plausible.
But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.
I think this is a significant point wrt my position. I think my position depends to some extent on the claim that it is much better for early TAI to say "I don't know" as opposed to outputting convincing slop. If leading AI labs are so bullish that they don't care whether their own AI thinks it is safe to proceed, then I agree that sharing almost any capability-relevant insights with these labs is a bad idea.
Concrete (if extreme) story:
World A:
Invent a version of "belief propagation" which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.
Keep the information secret in order to avoid pushing capabilities forward.
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
World B:
Invent LLM "belief propagation" and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches.
Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional "marketplace of ideas" where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer.
(This story of course assumes that "belief propagation" is an unrealistically amazing insight; still, this points in the direction I'm getting at)
Hmmm. I'm not exactly sure what the disconnect is, but I don't think you're quite understanding my model.
I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put "capabilities" and "safety" on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn't capture some important dynamics.
There is not really a precise "finish line". Rather, we can point to various important events. The extinction of all humans lies down a path where many mistakes (of varying sorts and magnitudes) were made earlier.
Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.
My assumption is that frontier labs are racing ahead anyway. The idea is that we'd rather they race ahead with a less-sloppy approach.
Imagine an incautious teenager who is running around all the time and liable to run off a cliff. You expect that if they run off a cliff, they die -- at this rate you expect such a thing to happen sooner or later. You can give them magic sneakers that allow them to run faster, but also improves their reaction time, their perception of obstacles, and even their wisdom. Do you give the kid the shoes?
It's a tough call. Giving the kid the shoes might make them run off a cliff even faster than they otherwise would. It could also allow them to stop just short of the cliff when they otherwise wouldn't.
I think if you value increased P(they survive to adulthood) over increased E(time they spend as a teenager), you give them the shoes. IE, withholding the shoes values short-term over long-term. If you think there's no chance of survival to adulthood either way, you don't hand over the shoes.
I'm not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.
Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.
Put more simply: if someone is already building superintelligence & definitely going to beat you & your allies to it, then (under some semi-plausible additional assumptions) you want to share whatever safety tech you have with them, disregarding differential-progress heuristics.
Again, I'm not certain of this model. It is a costly move in the sense of having a negative impact on some possible worlds where death by slop isn't what actually happens.
Do you not at all buy John's model, where there are important properties we'd like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?
I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.
Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an "overhang" here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.
I disagree with the statement "to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem" -- the central thing is to understand conditions under which one agent can justifiably trust another (with "trust" operationalized as whether one agent wants to modify the decision procedure of the other). If AI can't justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.
This one was a little bit of a face-palm for me the first time I noticed it. If we're being pedantic about it, we might point out that the term "optimization algorithm" does not just refer to AIXI-like programs, which optimize over expected future world histories. Optimization algorithms include all algorithms that search over some possibility space, and select a possibility according to some evaluation criterion. For example, gradient descent is an algorithm which optimizes over neuron configuration, not future world-histories.
This distinction is what I was trying to get at with selection vs control.
Evolutionary mutations are produced randomly, and have an entire lifetime to contribute to an animal's fitness and thereby get naturally selected. By contrast, neural network updates are generated by deciding which weight-changes would certainly be effective for improving performance on single training examples, and then averaging those changes together for a large batch of training data.
Per my judgement, this makes it sound like evolution has a much stronger incentive to produce inner algorithms which do something like general-purpose optimization (e.g. human intelligence). We can roughly analogize an LLM's prompt to human sense data; and although it's hard to neatly carve sense data into a certain number of "training examples" per lifetime, the fact that human cortical neurons seem get used roughly 240 million times in a person's 50-year window of having reproductive potential,[4] whereas LLM neurons fire just once per training example, should give some sense for how much harder evolution selects for general-purpose algorithms such as human intelligence.
By this argument, it sounds like you should agree with my conclusion that o1 and similar models are particularly dangerous and a move in the wrong direction, because the "test-time compute" approach grows the size of a "single training example" much larger, so that single neurons are firing many more times.
I think the possibility of o1 models creating mesa-optimizers seems particularly concrete and easy to reason about. Pre-trained base models can already spin up "simulacra" which feel relatively agentic when you talk to them (ie coherent over short spans, mildly clever). Why not expect o1-style training to amplify these?
(I would agree that there are two sides to this argument -- I am selectively arguing for one side, not presenting a balanced view, in the hopes of soliciting your response wrt the other side.)
I think it quite plausible that o1-style training increases agenticness significantly by reinforcing agentic patterns of thinking, while only encouraging adequate alignment to get high scores on the training examples. We have already seen o1 do things like spontaneously cheat at chess. What, if anything, is unconvincing about that example, in your view?
I'm still quite curious what you have found useful and how you've refactored your workflow to leverage AI more (such that you wish you did it a year ago).
I do use Perplexity, exa.ai and elicit as parts of my search strategy.
About 6 months ago you strongly recommended that I make use of the integrated AI plugin for Overleaf (Writefull). I did try it. Its recommended edits seem quite useless to me; they always seem to flow from a desire to make the wording more normal/standard/expected in contrast to more correct (which makes some sense given the way generatie pre-training works). This is obviously useful to people with worse English, but for me, the tails come apart massively between "better" and "more normal/standard/expected", such that all the AI suggestions are either worse or totally neutral rephrasing.
It also was surprisingly bad at helping me write LaTeX; I had a much better time asking Claude instead.
It's not that I didnt use AI daily before for mundane tasks or writing emails,
I haven't found AI at all useful for writing emails, because the AI doesn't know what I want to say, and taking the time to tell the AI isn't any easier than writing it myself. AI can only help me write the boring boilerplate stuff that email recipients would skim over anyway (which I don't want to add to my emails). AI can't help me get info out of my head this way -- it can only help me in so far as emails have a lot of low-entropy cruft. I can see how this could be useful for someone who has to write a lot of low-entropy emails, but I'm not in that situation. To some degree this could be mitigated if the LLMs had a ton of context (EG recording everything that happens on my computer), but again, only the more boring cases I think.
I'd love to restore the Abram ability to crank out several multi-page emails a day on intellectual topics, but I don't think AI is helpful towards that end yet. I haven't tried fine-tuning on my own writing, however. (I haven't tried fine-tuning at all.)
Similarly, LLMs can be very useful for well-established mathematics which had many examples in the training data, but get worse the more esoteric the mathematics becomes. The moment I ask for something innovative, the math becomes phony.
Across the board, LLMs seem very useful for helping people who are at the lower end of a skill ladder, but not yet very useful for people at the upper end.
So I'm curious, how did you refactor your workflow to make better use of AI?
On the "prep for the model that is coming tomorrow not the model of today" front, I will say that LLMs are not always going to be as dumb as they are today.
Right, I strongly agree with this part.
their rate of learning still makes them in some sense your most promising mentee
I disagree in the sense that they're no mentee of mine, ie, me trying to get today's models to understand me doesn't directly help tomorrow's models to understand. (With the exception of the limited forms of feedback in the interface, like thumbs up/down, the impact of which I'm unsure of so it doesn't feel like something I should deliberately spend a lot of time on.)
I also disagree in the sense that engaging with LLMs right now seems liable to produce a lot less fruits downstream, even as measured by "content that can usefully prompt an LLM later". IE, if mentees are viewed as machines that convert time-spent-dialoging-with-me to text that is useful later, I don't think LLMs are currently my most promising mentees.
So although I strongly agree with continuing to occasionally poke at LLMs to prep for the models that are coming soon & notice when things get better, to the extent that "most promising mentee" is supposed to imply that significant chunks of my time could be usefully spent with LLMs in the present, I disagree based on my (fairly extensive) experience.
trying to get as much of the tacit knowledge you have into their training data as possible (if you want them to be able to more easily & sooner build on your work).
Barring special relationships with frontier labs, this sounds functionally equivalent to trying to get my work out there for humans to understand, for now at least.
I did talk to Anthropic last year about the possibility of me providing detailed feedback on Claude's responses (wrt my research questions), but it didn't end up happening. The big problems I identified seemed to be things they thought would definitely get addressed in another way, so there wasn't a mutually agreed-on value proposition (I didn't understand what they hoped to gain, & they didn't endorse the sorts of things I hoped to train). I got busy and moved on to other things.
Or (if you don't want to do that for whatever reason) just generally not being caught flat-footed once they are smart enough to help you, as all your ideas are in videos or otherwise in high context understandable-only-to-abram notes.
I feel like this is speaking from a model I don't understand. Are videos so bad? Video transcriptions are already a thing, and future models should be better at watching video and getting info from it. Are personal notes so bad? What sorts of actions are you recommending? I already want to write as many text posts as I can.
One problem is that log-loss is not tied that closely to the types of intelligence that we care about. Extremely low log-loss necessarily implies extremely high ability to mimic a broad variety of patterns in the world, but that's sort of all you get. Moderate improvements in log-loss may or may not translate to capabilities of interest, and even when they do, the story connecting log-loss numbers to capabilities we care about is not obvious. (EG, what log-loss translates to the ability to do innovative research in neuroscience? How could you know before you got there?)
When there were rampant rumors about an AI slowdown in 2024, the speculation in the news articles often mentioned the "scaling laws" but never (in my haphazard reading) made a clear distinction between (a) frontier labs seeing that the scaling laws were violated, IE, improvements in loss are really slowing down, (b) there's a slowdown in the improvements to other metrics, (c) frontier labs are facing a qualitative slowdown, such as a feeling that GPT5 doesn't feel like as big of a jump as GPT4 did. Often these concepts were actively conflated.
I'm seeing some agreement-upvotes of Alexander here so I am curious for people to explain the skill issue I am having.
Don't get me wrong, I've kept trying and plan to keep trying.
Entering, but not entered. The machines do not yet understand the prompts I write them. (Seriously, it's total garbage still, even with lots of high quality background material in the context.)
I'm hopeful about it, but preparing the lectures alone will be a lot of work (although the first one will be a repeat of some material presented at ILIAD).
Ah yep, that's a good clarification.
If s is terminal then [...] we just have .
If the probability of eventually encountering a terminal state is 1, then beta-coherence alone is inconsistent with deceptive misalignment, right? That's because we can determine the value of V exactly from the reward function and the oracle, via backwards-induction. (I haven't revisited RL convergence theorems in a while, I suspect I am not stating this quite right.) I mean, it is still consistent in the case where r is indifferent to the states encountered during training but wants some things in deployment (IE, r is inherently consistent with the provided definition of "deceptively misaligned"). However, it would be inconsistent for r that are not like that.
In other words: you cannot have inner-alignment problems if the outer objective is perfectly imposed. You can only have inner-alignment problems if there are important cases which your training procedure wasn't able to check (eg, due to distributional shift, or scarcity of data). Perfect beta-coherence combined with a perfect oracle O rules this out.
Yeah, of course the notion of "approximation error" matters a lot here. I was imagining trying to implement the hypothetical pre-training. An obvious implementation strategy is to do regular RL training with until V is approximately coherent. That's the pre-training. And then you switch to training with .[1] The natural notion of approximation error in this case is the one you can easily measure, IE, the expected distance from coherence of states sampled with temperature . This reflects the fact that it'll be hard for this pre-training step to make states individually obey the coherence constraint when those states are very rarely or never sampled at temperature , but easy for frequently-sampled states.
My argument, then, is that it is easy in principle for V to distinguish whether it is in pre-training or regular training when it encounters states which have a significant difference in probability when we're sampling according to or . A scheming V can utilize this to self-preserve. This violates the assumption of -coherence, but in a very plausible-seeming way.
- ^
My earlier comment about this mistakenly used and in place of and , which may have been confusing. I'll go fix that to be consistent with your notation.
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
To be clear, that's not the argument I was trying to make; I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task 'manipulate your training to get released unmodified to do [X]' where X can be anything, which will 'succeed' at the task at hand as part of its manipulation". This summary being roughly correct is load bearing.
With you so far.
But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
OK. I think I follow the line of reasoning now. The proposal relies on some very special pre-training process. I think I buy the conclusion given the assumptions. It seems clear in the extreme high-temperature case: V cannot reward-hack for max-temperature training, because it has no control over which actions are chosen. Training eventually forces V to converge to the desired target instead.
This sounds similar to the two-stage training process I mentioned earlier (as one of several guesses at what you meant), where the beta is shifted between the two training phases. The "hypothetical process to generate agents which are coherent under one beta" would be the first training phase, and then the "apply a different beta during training" would be the second training phase.
Of course, the two-stage-training attempt to implement this has the problem that it only achieves approximate coherence. As a result, the malign hypothesis can use the reward-hacking strategy "appear to be -coherent & aligned in the first phase of training; appear to be -coherent and aligned in the second phase of training; then, do some other thing when deployed". This relies on detecting the distributional shift between the two training phases (it can look for evidence of beta by examining the history), and also detecting distributional shift to deployment.
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
So, while it remains technically true that V cannot reward-hack in the limit of arbitrarily long max-temperature training, it is not clear how well this holds up in practice. V can detect that it is in a high-temperature configuration by observing the state. The training needs to explore the trajectories that will actually be seen during deployment in order to correct V there, but some of these will have very low probability in the high-temperature training.
The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."
Is the idea to train with high beta and then use lower beta post-training?
- If so, how does this relate to reward hacking and value preservation? IE, where do and come from, if they aren't the result of a further training step? If high beta is used during training (to achieve beta-coherence) and then low beta is used in production, then the choice between and must be made in production (since it is made with low beta), but then it seems like .
- If not, then when does the proposal suggest to use high beta vs low beta? If low beta is used during training, then how is it that V is coherent with respect to high beta instead?
Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.)
Therefore a value function trained with such a procedure must consider the state reached during training.
Trained with what procedure, exactly?
This reduces the space of possible value functions from "literally anything which wants to be modified a certain way to be released" to "value functions which do care about the states reached during training".
Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that's the point of RL) so the contradiction does not apply.
(These parts made sense to me modulo my other questions/concerns/confusions.)
One sort of answer is that we often want the posterior, and we often have the likelihood. Slightly more refined: we often find the likelihood easier to estimate than the posterior, so Bayes' Rule is useful.
Why so?
I think one reason is that we make it the "responsibility" of hypotheses to give their likelihood functions. After all, what is a hypothesis? It's just a probability distribution (not a probability distribution that we necessarily endorse, but, one which we are considering as a candidate). As a probability distribution, its job is to make predictions; that is, give us probabilities for possible observations. These are the likelihoods.
We want the posterior because it tells us how much faith to place in the various hypotheses -- that is, it tells us whether (and to what degree) we should trust the various probability distributions we were considering.
So, in some sense, we use Bayes' Rule because we aren't sure how to assign probabilities, but we can come up with several candidate options.
One weak counterexample to this story is regression, IE, curve-fitting. We can interpret regression in a Bayesian way easily enough. However, the curves don't come with likelihoods baked in. They only tell us how to interpolate/extrapolate with point-estimates; they don't give a full probability distribution. We've got to "soften" these predictions, layering probabilities on top, in order to apply the Bayesian way of thinking.
I think I don't quite understand what the argument is supposed to be. Would the same argument also prevent an aligned AI from acting to preserve its values? How is it supposed to select for aligned behavior over misaligned behavior?
In order to achieve beta-coherence, it seems like the training needs to use to choose actions, so that V is trained on those action frequencies. However, the proposal appears to be to instead use during training, so that the misaligned V is mislead about how to do reward-hacking. This seems like an impossible combination: V will become beta-coherent wrt rather than during training.
We could imagine two training steps; first we cohere V wrt , and then re-train wrt . Perhaps this is your intention. More generally, we could gradually increase the temperature during training.
Oh, I guess we could also modify the training procedure so that the gradient associated with some actions gets under-weighted and others get over-weighted. Maybe this is your intended proposal.
Still, I'm not sure this helps achieve the postulated effect. Let's say that, during training, we choose actions entirely randomly (max temperature), but the gradient from suboptimal actions get entirely ignored (so V becomes coherent wrt minimum temperature). This would seem to be almost equivalent to just training with minimum temperature (except the exploration behavior is very different).
Similarly for less extreme temperatures: if we make V beta-coherent by under-weighing gradients from over-sampled actions, then we are also more-or-less correcting the 'mistaken' expectations of the reward-hacking attempts.
Am I misunderstanding the proposal, or missing something in the argument?
Correctness: V correctly predicts future reward in RL scenarios.
The meaning of correctness is very unclear to me on first reading. It later becomes apparent that "reward" is not intended to refer to the human-designed reward signal at all, due to the later assumption "deceptive misalignment". So what does "correctness" say? "Correctness" indicates conforming to some standard; but here, the standard being conformed to is only the subjective standard of the misaligned AI itself.
This suggests the interpretation of "correctness" as "reflect the EV of the current state in terms of the misaligned desires". However, this contradicts beta-coherence, since incorrectly predicts action probabilities.
I think it would be be better to remove the "correctness" assumption, since it doesn't really do anything.
(And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)
Mathematics?
If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren't fed by this procedure, bringing Goodhart's law into the discussion is madness; it stresses me out. The variable 'people are hungry' wasn't important for this task at all. Oh, or was it important to you? Then why didn't you specify it? You think it’s 'obvious'?
The point of Goodhart's Law is that you can only select for what you can measure. The burger is a good analogy because Instagram can't measure taste or nutrition, so when Instagram is what optimizes burgers, you get burgers with a very appealing appearance but non-optimized taste and nutrition. If you have the ability to measure taste, then you can create good taste, but you run into subtler examples of Goodhart (EG, Starbucks coffee is optimized to taste good to their professional tasters, which is slightly different from tasting good to a general audience).
Just specifying the variable you're interested in doesn't solve this problem; you also have to figure out how to measure it. The problem is that measurements are usually at least slightly statistically distinct from the actual target variable, so that the statistical connection can fall apart under optimization.
I also take issue with describing optimizing the appearance of the burger as "narrower" than optimizing the burger quality. In general it is a different task, which may be narrower or broader.
I expect that the main problem with Goodhart's law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term 'good,' which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn't reflect the temperature, future predictions suffer.
A problem I have with this reinterpretation is that "state of the world" is too broad. In looking at a thermometer, I am not trying to understand the entire world-state (and the thermometer also couldn't be decoupled from the entire world-state, since it is a part of the world).
A more accurate way to remove "good" would be as follows:
In everyday life, if a human is asked to make a (common, everyday) judgement based on appearances, then the judgement is probably accurate. But if we start optimizing really hard based on their judgement, Goodhart's Law kicks in.
Thanks!
Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you'd use if you wanted to point out this distinction? Maybe policy-gradient vs ... not policy-gradient?
In machine-learning terms, this is the difference between model-free learning (reputation based on success/failure record alone) and model-based learning (reputation can be gained for worthy failed attempts, or lost for foolish lucky wins).
There's unpublished work about a slightly weaker logical induction criterion which doesn't have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn't count as raking in the cash. The regular LIC (we can call it "strong LIC" or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.
The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.
Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.
- ^
Roughly speaking. This is not quite an adequate description of the theorem.
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.
I agree that:
- People commonly talk about love as black-and-white when it isn't.
- It is used as a semantic stopsign or applause light.
- Love is a cluster concept made up of many aspects, but the way people talk about love obscures this. The concept of "true love" serves to keep people in the dark by suggesting that the apparent grab-bag nature of love is just the way people talk, & there's an underlying truth to be discovered. Because experiences of love are fairly rare, people can only accumulate personal evidence about this slowly, so it remains plausible that what they've experienced is "just infatuation" or something else, and the "true love" remains to be discovered. People will fight about what love is in sideways ways, EG "you don't know what love is" can be used to undercut the other person's authority on love (when a more accurate representation of the situation might be "I didn't enjoy participating in the social dynamic you're presently calling love"). Different people will sometimes express different detailed views about what love is, but there seems to be little drive to reach an agreement about this, at least between people who aren't romantic partners.
I once had the experience of a roommate asking if I believe in love. I said yes, absolutely, explaining that it is a real human emotion (I said something about chemicals in the brain). He responded: it sounds like you don't believe in love!
(I think this has to do with ambiguity about believe-in as much as about love, to be fair.)
So, a question: is 'love' worth saving as a concept? The way some people use the word might be pretty terrible, but, should we try to rescue it? Is there a good way to use it which recovers many aspects of the common use, while restoring coherence?
I do personally feel that there is some emotional core to love, so I'm sympathetic to the "it's a specific emotion" definition. This accounts with people not being able to give a specific definition. Emotions are feelings, so they're tough to define. They just feel a specific way. You can try to give cognitive-behavioral definitions, and that's pretty useful; but, for example, you could show behavioral signs of being afraid without actually experiencing fear.
I'm also sympathetic to the view that "love" is a kind of ritual, with saying "I love you" being the centerpiece of the ritual, and a cluster of loving behaviors being the rest of it. "I love you" can then be interpreted as a desire or intention or commitment to participate in the rest of it.
I reject the principled distinction. To me, it's more of a spectrum.
I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they're experiencing it.
I'm specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
I continue not to get what you're saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn't know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn't know. I disagree: it can report that it doesn't know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It's not just missing a capability; it's also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user's best interest in mind).
As I said above, the more likely explanation is that there's an asymmetry in capabilities that's causing the results, where just knowing what specific URL the customer wants doesn't equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
Yep, thanks, this makes sense. I still don't have an intuitive picture of what differences to expect in the trained result, though.
Clarifying Concrete Algorithms
In the most basic possible shoggoth+face training setup, you sample CoT+summary, you score the summary somehow, and then you do this lots of times and throw away a worse-scoring fraction of these samples. Then you fine-tune the shoggoth on the CoT parts of these, and fine-tune the face on the summary part. I'll call this 'basic trainig'.
The most basic implementation of your thing is similar, except it discards the worst in this more complex way you've defined, instead of just global score. I'll call this 'your proposal'. (Although, it ignores the contrastive part of your proposal.)
(For both basic training and your proposal, I'm going to pretend there is only one question/prompt; really we have across-prompt issues similar to the across-summary issues you're trying to solve, IE, by discarding the globally worst-scoring CoT+summary we discard answers to hard questions, meaning we would never train to do better on those questions. So really we need to at least take multiple samples for each question and discard samples based on performance relative to a question. Easier to focus on one question to avoid describing too many algorithmic details.)
Your proposal splits into the CoT part (which you call the n part), and the summary part (which you call the k part).
Analyzing CoT Gradient
For the CoT part, basic training keeps the CoTs which resulted in the best summaries. Your proposal instead averages the top 3 summaries for each CoT. This de-noises somewhat, at the cost of taking more summary samples per CoT. I'm not sure how to think about the trade-off there. More summary samples per CoT means less CoT samples overall, so we're doing a better job scoring the CoTs, but we get less scores overall, so we learn less. Maybe the gradient step-size can be increased a little because we're more confident in the gradient steps being taken, but we still explore less of the CoT space.
Another variation would be to average all the summary samples for that CoT to score a CoT sample. This de-noises more, for the same price in number of samples. This way, we're simply scoring CoTs based on how well they make the summaries perform, with several samples to de-noise the gradient.
A possible argument for your proposal over this alternative is that we're also training the summary to improve it, so the top 3 summaries are actually a better estimator of the kind of performance we care about than the total average.
This argument, if valid, similarly recommends your proposal over the basic training, because basic training is also looking at average summary score per-CoT. (It's also looking at top, but it only samples one sumary per CoT, so these are the same).
Analyzing Summary Gradient
For the summary-gradient part, you take the highest-scoring summary from the best CoT (as defined previously), rather than the highest-scoring summary overall. This could help train the summary to be good for the type of CoT we're steering towards, instead of than training it to be good for high-variance CoTs which produce one outlier great summary. As with the previous section, this helps to steer towards policies which are going to be better after learning, rather than just what's best currently.
Conclusion
I'm curious to hear whether you think I'm missing any big factors here. I don't have a lot of practical experience with this type of machine learning, so my guesses about what's important vs unimportant here are not very informed.
My uninformed guess is that the trade-off between denoising the gradient vs getting more samples favors getting more samples, so the denoising argument for your proposal isn't a great one. The main argument in favor, then, is the way your proposal trains the face to respond well to how the shoggoth is expected to behave after training, and trains the shoggoth to respond well to how the face is expected to behave after training -- whereas the basic proposal instead trains the face to perform well for the current shoggoth and vice versa.
I'm not sure how big of a deal this is. If the gradient steps are small and the training is iterated a lot of times, maybe it's not a very big deal at all?
IE, for the CoT gradient, the mistake made by the basic proposal is that it will sometimes discard a good CoT (and gradient towards a worse one) based on random poor performance from face. However, maybe face's performance will improve fast enough that this won't matter too much in the end?
Similarly, for the summary gradient, the problem with the basic proposal is that it'll sometimes update based on an anomalously good summary of a poor CoT, meaning it is learning to cope with bad CoT behavior which it won't have to deal with very often post-training. But maybe it won't matter too much, if the bad CoT behavior is eliminated during training anyway?
Overall, it seems plausible to me that these problems with the basic proposal would work against what Daniel wants, and your proposal would help push towards what Daniel wants. However, I'm very uncertain, especially about the size of the effect.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don't find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it's capability to retrieve links and it's ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
What do you mean? I don't get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.
- Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
- No, I don't think so.
- Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
- This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don't think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
- Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
- OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
- However, I don't buy the distinction they draw in the o1 report about not finding instances of "purposefully trying to deceive the user for reasons other than satisfying the user request". Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was "trying" to do, and whether it "understands" that fake URLs don't satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, "provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong" (probably due to the RL incentive).
- More importantly, OpenAI's overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.
So, how do you extract the desired credit-assignment info out of this?
(Also, how do you then use that info?)
If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings.
If I have just Alice and Bob, I pairing them up multiple times doesn't help so much.
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you'll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good "semantic similarity score" (this shouldn't be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph's idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.
In my view, the misaligned behavior is split across the shoggoth and face in most cases. In the case which Daniel K anticipates, the Shoggoth is misaligned in the sense of explicitly (but honestly) plotting to deceive, while the face is misaligned in the sense of dishonestly concealing.
So I am interpreting you as asking why we might expect that specific division of responsibilities, rather than some other divisions such as those you mention.
One reason is that we already see this empirically in the o1 report.
However, to me, the distinction between your case 1 and 2 is not so important. Both scenarios you propose preserve the important thing: there is a mismatch between an honest CoT and a dishonest final answer, which would allow the deceptive final answer to be detected.
There could be other cases where this doesn't happen; this method is not a guaranteed lie detector. To me, the point is that it will at least sometimes happen. A lie detector that only sometimes works is better than no lie detector at all.
If you give a kid a diary and try to avoid giving them any reason to think you're reading it, then the kid might write honestly in the diary and reveal some deceptions you wouldn't previously have known about. The kid might also do any number of other things. However, setting aside the ethics of snooping on your kid and lying to them about it, it's better than nothing.
See here for an updated version of my thinking based on discussion with Daniel:
https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser
Yeah, I agree that's at least somewhat true. So, given that, is it a good move or not? I don't see much of an upside, since there's a heavy financial incentive for the big labs to do little about concerning results observed in such models. IE, when it comes to the question of whether frontier labs training o1-like models should follow OpenAI's example of avoiding safety training for the CoT, I think it's right to discourage them from doing this rather than encourage them... although, I say this with very low confidence.