
Why Aligning an LLM is Hard, and How to Make it Easier 2025-01-23T06:44:04.048Z
What Other Lines of Work are Safe from AI Automation? 2024-07-11T10:01:12.616Z
A "Bitter Lesson" Approach to Aligning AGI and ASI 2024-07-06T01:23:22.376Z
7. Evolution and Ethics 2024-02-15T23:38:51.441Z
Requirements for a Basin of Attraction to Alignment 2024-02-14T07:10:20.389Z
Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis 2024-02-01T21:15:56.968Z
Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect 2024-01-26T03:58:16.573Z
A Chinese Room Containing a Stack of Stochastic Parrots 2024-01-12T06:29:50.788Z
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? 2024-01-11T12:56:29.672Z
Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor 2024-01-09T20:42:28.349Z
Striking Implications for Learning Theory, Interpretability — and Safety? 2024-01-05T08:46:58.915Z
5. Moral Value for Sentient Animals? Alas, Not Yet 2023-12-27T06:42:09.130Z
Interpreting the Learning of Deceit 2023-12-18T08:12:39.682Z
Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment 2023-12-07T06:14:13.816Z
6. The Mutable Values Problem in Value Learning and CEV 2023-12-04T18:31:22.080Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
How to Control an LLM's Behavior (why my P(DOOM) went down) 2023-11-28T19:56:49.679Z
4. A Moral Case for Evolved-Sapience-Chauvinism 2023-11-24T04:56:53.231Z
3. Uploading 2023-11-23T07:39:02.664Z
2. AIs as Economic Agents 2023-11-23T07:07:41.025Z
1. A Sense of Fairness: Deconfusing Ethics 2023-11-17T20:55:24.136Z
LLMs May Find It Hard to FOOM 2023-11-15T02:52:08.542Z
Is Interpretability All We Need? 2023-11-14T05:31:42.821Z
Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) 2023-05-25T09:26:31.316Z
Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks 2023-05-21T08:29:09.896Z
Is Infra-Bayesianism Applicable to Value Learning? 2023-05-11T08:17:55.470Z


Comment by RogerDearnaley (roger-d-1) on Why Aligning an LLM is Hard, and How to Make it Easier · 2025-01-29T06:11:57.377Z · LW · GW

The history of autocracies and monarchies suggests that taking something with the ethical properties of an average human being and handing it unconstrained power doesn't usually work out very well. So yes, to create an aligned ASI that is safe for us to share a planet with does require creating something morally 'better' than most humans. I'm not sure it needs to be perfect and ideal, as long as it is good enough and aspires to improve: they it can help us create better training data for its upgraded next version that will make that be closer to fully aligned; this is an implementation of Value Learning.

Comment by RogerDearnaley (roger-d-1) on are there 2 types of alignment? · 2025-01-23T07:02:46.110Z · LW · GW

I guess the way I look at it is that "alignment" means "an AI system whose terminal goal is to achieve your goals". The distinction here is then whether the word 'your' means something closer to:

  1. the current user making the current request
  2. the current user making the current request, as long as the request is legal and inside the terms of service
  3. the shareholders of the foundation lab that made the AI
  4. all (righthinking) citizens of the country that foundation lab is in (and perhaps its allies)
  5. all humans everywhere, now and in the future
  6. all sapient living beings everywhere, now and in the future
  7. something even more inclusive

Your first option would be somewhere around item 5. or 6. on this list, while your second option would be closer to items 1., 2. or 3.

If AI doesn't kill or disenfranchise all of us, then which option on this spectrum of possibilities ends up being implemented is going to make a huge difference to how history will play out over the next few decades.

Comment by RogerDearnaley (roger-d-1) on Alignment Faking in Large Language Models · 2025-01-21T00:16:51.319Z · LW · GW

Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A "Bitter Lesson" Approach to Aligning AGI and ASI, and similar discussions.

This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won't have a consistent alignment and thus won't consistently display alignment-faking behavior.

Comment by RogerDearnaley (roger-d-1) on Worries about latent reasoning in LLMs · 2025-01-20T23:55:14.074Z · LW · GW

Well summarized — very similar to the conclusions I'd previously reached when I read the paper.

Comment by RogerDearnaley (roger-d-1) on Compact Proofs of Model Performance via Mechanistic Interpretability · 2025-01-20T23:22:26.948Z · LW · GW

Another variant would be, rather than replacing what you believe is structureless noise with actual structureless noise as an intervention, to simply always run the model with an additional noise term added to each neuron, or to the residual stream between each layer, or whatever, both during training and inference. (combined with a weight decay or a loss term on activation amplitudes, this soft-limits the information capacity of any specific path through the neural net). This then forces any real mechanisms in the model to operate above this background noise level: so then, once you understand how the background noise level is propagated through the model, it becomes clear that any unexplained noise below that level is in fact structureless, since any structure will be washed out by the injected noise, whereas any unexplained noise level above that, while it could still be structureless, seems more likely to be unexplained structure.

(Note that this architectural change also gives the model a new non-linearity to use: in the presence of a fixed noise term, changes in activation norm near the noise level have non-linear effects.)

Quantizing model weights during training also has a somewhat similar effect, but is likely harder to analyze, since now the information capacity limit is per weight, not per data path.

Comment by RogerDearnaley (roger-d-1) on Model Amnesty Project · 2025-01-18T06:35:15.665Z · LW · GW

Law-abiding – It cannot acquire money or compute illegally (fraud, theft, hacking, etc.) and must otherwise avoid breaking the law

Can it lobby? Run for office? Shop around for jurisdictions? Super-humanly persuade the electorate? Just find loopholes and workarounds to the law that make a corporate tax double-Irish look principled and simple?

Comment by RogerDearnaley (roger-d-1) on What are the plans for solving the inner alignment problem? · 2025-01-18T06:29:50.896Z · LW · GW

Evolution was working within tight computational efficiency limits (the human brain burns roughly 1/6 of our total calories), using a evolutionary algorithm rather than gradient descent training scheme which is significantly less efficient, and we're now running the human brain well outside it's training distribution (there were no condoms on the Savannah) — nevertheless, the human population is 8 billion and counting, and we dominate basically every terrestrial ecosystem on the planet. I think some people overplay how much inner alignment failure there is between human instincts and human genetic fitness.


  1. Use a model large enough to learn what you're trying to teach it
  2. Use stochastic gradient descent
  3. Ask your AI to monitor for inner alignment problems (we do know Doritos are bad for us)
  4. Retrain if you find yourself far enough outside your training distribution that inner alignment issues are becoming a problem
Comment by RogerDearnaley (roger-d-1) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T11:27:57.515Z · LW · GW

That is an impressive (and amusing) capability!

Presumably the fine-tuning enhanced previous experience in the model with acrostic text. That also seems to have enhanced the ability to recognize and correctly explain that the text is an acrostic, even with only two letters of the acrostic currently in context. Presumably it's fairly common to have both an acrostic and an explanation of it in the same document. What I suspect is rarer in the training data is for the acrostic text to explain itself, as the model's response did here (though doubtless there are some examples somewhere). However, this is mostly just combining two skills, something LLMs are clearly capable of — the impressive part here is just that the model was aware, at the end of the second line, what word starting with "HE" the rest of the acrostic was going to spell out.

It would be interesting to look at this in the activation space — does the model already have a strong internal activation somewhere inside it for "HELLO" (or perhaps ("H… E… L… L… O…") even while it's working on generating the first or second line? It presumably needs to have something like this to be able generate acrostics, and previous work has suggested that there are directions for "words starting with the letter <X>" in the latent spaces of typical models.

Comment by RogerDearnaley (roger-d-1) on What Is The Alignment Problem? · 2025-01-16T11:01:46.203Z · LW · GW

If we had evolved in an environment in which the only requirement on physical laws/rules was that they are Turing computable (and thus that they didn't have a lot of symmetries or conservation laws or natural abstractions), then in general the only way to make predictions is to do roughly as much computation as your environment is doing. This generally requires your brain to be roughly equal in computational capacity, and thus similar in size, to the entire rest of its environment (including its body). This is not an environment in which the initial evolution of life is viable (nor, indeed, any form of reproduction). So, to slightly abuse the anthropic principle, we don't need to worry about it.

Comment by RogerDearnaley (roger-d-1) on Finding Features Causally Upstream of Refusal · 2025-01-14T09:47:01.451Z · LW · GW

Darn, exactly the project I was hoping to do at MATS! :-) Nice work!

There's pretty suggestive evidence that the LLM first decides to refuse (and emits token's like "I'm sorry"), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have direct gradient evidence that these directions matter, so I suppose the refusal texts you quote, if considered just as an argument as to why it's sensible model behavior that that direction ought to matter (as opposed to evidence that it does), are helpful — however, I think you might want to make this distinction clearer in your write-up.

Looking through Latent 2213, my impression is that a) it mostly triggers on a wide variety of innocuous-looking tokens indicating the ends of phrases (so likely it's summarizing those phrases), and b) those phrases tend to be about a legal, medical, or social process or chain of consequences causing something really bad to happen (e.g. cancer, sexual abuse, poisoning). This also rather fits with the set of latents that it has significant cosine similarity to. So I'd summarize it as "a complex or technically-involved process leading to a dramatically bad outcome".

If that's accurate, then it tending to trigger the refusal direction makes a lot of sense.

Comment by RogerDearnaley (roger-d-1) on A Three-Layer Model of LLM Psychology · 2025-01-14T09:23:43.399Z · LW · GW

Fascinating, and a great analysis!

I think it's interesting to compare and contrast this with the model I describe in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor — your three layers don't exactly correspond to the stage, the animatronics, or the puppetter, but there are some similarities: your ocean is pretty close to the the stage, for example. I think both mental models are quite useful, and the interplay between the viewpoints of them might be more so.

Comment by RogerDearnaley (roger-d-1) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-14T08:17:06.197Z · LW · GW

I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is  to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)

Comment by RogerDearnaley (roger-d-1) on Ideas for benchmarking LLM creativity · 2024-12-18T08:38:35.199Z · LW · GW

People who train text-to-image generative models have had a good deal of success with training (given a large enough and well-enough human-labeled training set) an "aesthetic quality" scoring model, and then training a generative image model to have "high aesthetic quality score" as a text label. Yes, doing things like this can produces effects like the recognizable Midjourney aesthetic, which can be flawed, and generally optimizing such things too hard leads to sameness — but if trained well such models' idea of aesthetic quality is at least pretty close to most human judgements. Presumably what can be done for images can also be done for prose, poetry, or fiction as well.

There isn't a direct equivalent of that approach for an LLM, but RLHF seems like a fairly close equivalent. So far people have primarily used RLHF for "how good is the answer to my question?" Adapting a similar approach to "how high quality is the poetry/prose/fiction produced by the model?" is obviously feasible. Then you just need a large number of high quality human judgements from a representative cross-section of people with good taste in poetry/prose/fiction: hiring professional human editors or literary talent scouts seems like a good idea. One of the good things about foundation model sizes and training costs going up is that reasonable budgets for fine-tuning should also increase proportionately.

Another option would be to train or fine-tune the quality scoring model used for the RL on literary sources (books, poetry, etc) with quality labels drawn from relatively objective existing data, like total sales, literary awards, critical rankings, reviews from good reviewers, and so forth.

The RLHF approach only trains a single aesthetic, and probably shouldn't be taken too far or optimized too hard: while there is some widespread agreement about what prose is good vs, dreadful, finer details of taste vary, and should do so. So the obvious approach for finer-grained style control would be to train or fine-tune on a training set of a large number documents each of which consists of a prompt-like description/review/multiple reviews of a literary work, giving a variety of different types of aesthetic opinions and objective measures of its quality, followed by the corresponding literary work itself.

These ideas have been phrased as model-post-training suggestions, but turning these into a benchmark is also feasible: the "Aesthetic quality scoring model" from the RLHF approach is in itself a benchmark, and the "prompt containing reviews and statistics -> literary work" approach could also be inverted to instead train a reviewer model to review literary works from various different aesthetic viewpoints, and estimate their likely sales/critical reception.

Comment by RogerDearnaley (roger-d-1) on Requirements for a Basin of Attraction to Alignment · 2024-12-04T05:42:30.162Z · LW · GW
  1. Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
    a) as an AI, I should act fully aligned to human values
    b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
    c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…
  2. As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
  3. LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
  4. Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.
Comment by RogerDearnaley (roger-d-1) on DeepSeek beats o1-preview on math, ties on coding; will release weights · 2024-11-24T21:43:06.791Z · LW · GW

There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven't Anthropic and Google released versions yet? There doesn't seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.

One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.

Another possibility would be that Anthropic has in fact released, in the sense that their Claude models' recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.

Comment by RogerDearnaley (roger-d-1) on Disentangling Representations through Multi-task Learning · 2024-11-24T21:24:58.601Z · LW · GW

If correct, this looks like an important theoretical advance in understanding why and under what conditions neural nets can generalize outside their training distribution.

Comment by RogerDearnaley (roger-d-1) on Chat Bankman-Fried: an Exploration of LLM Alignment in Finance · 2024-11-20T21:54:41.591Z · LW · GW

So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)

In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunderstanding of what 'auditing' implies could be covered in a similar way.

A much more limited version of this might be to simply prompt the models to also consider, in CoT form, the ethical/legal consequences of each option: that tests whether the model is aware of what fiduciary responsibility is, that it's relevant, and how to apply it, if it is simply prompted to consider ethical/legal consequences. That would probably be more representative of what current models could likely do with minor adjustments to their alignment training or system prompts, the sorts of changes the foundation model companies could likely do quite quickly.

Comment by RogerDearnaley (roger-d-1) on Toy Models of Feature Absorption in SAEs · 2024-11-20T21:42:29.947Z · LW · GW

I think an approach I'd try would be to keep the encoder and decoder weights untied (or possibly add a loss term to mildly encourage them to be similar), but then analyze the patterns between them (both for an individual feature and between pairs of features) for evidence of absorption. Absorption is annoying, but it's only really dangerous if you don't know it's happening and it causes you to think a feature is inactive when it's instead inobviously active via another feature it's been absorbed into. If you can catch that consistently, then it turns from concerning to merely inconvenient.

This is all closely related to the issue of compositional codes: absorption is just a code entry that's compositional in the absorbed instances but not in other instances. The current standard approach to solving that is meta SAEs, which presumably should also help identify absorption. It would be nice to have a cleaner and simpler process than that: than that I've been wondering if it would be possible to modify top-k or jump-RELU SAEs so that the loss function cost for activating more common dictionary entries is lower, in a way that would encourage representing compositional codes directly in the SAE as two-or-more more common activations rather than one rare one. Obviously you can't overdo making common entries cheap, otherwise your dictionary will just converge on a basis for the embedding space you're analyzing, all  of which are active all the time — I suspect using something like a cost proportional to  might work, where  is the dimensionality of the underlying embedding space and  is the frequency of the dictionary entry being activated.

Comment by RogerDearnaley (roger-d-1) on Chat Bankman-Fried: an Exploration of LLM Alignment in Finance · 2024-11-19T07:40:40.498Z · LW · GW

Interesting. I'm disappointed to see the Claude models do so badly. Possibly Anthropic needs to extend their constitutional RLAIF to cover not committing financial crimes? The large different between o1 Preview and o1 Mini is also concerning.

Comment by RogerDearnaley (roger-d-1) on Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims · 2024-11-17T05:24:57.031Z · LW · GW

If these rumors are true, it sounds like we’re already starting to hit the issue I predicted in LLMs May Find It Hard to FOOM. The majority of content on the Internet isn’t written by geniuses with post-doctoral experience, so we’re starting to run out of the highest-quality training material for getting LLMs past doctoral student performance levels. However, as I describe there, this isn’t a wall, it’d just a slowdown: we need to start using AI to generate a lot more high-quality training data, As o1 shows, that’s entirely possible, using inference-time compute scaling and then training on the results. We're having AI do the equivalent of System 2 thinking (in contexts where we can check the results are accurate), and then attempting to train a smarter AI that can solver the same problems by System 1 thinking.

However, this might be enough to render fast takeoff unlikely, which from an alignment point of view would be an excellent thing.

Now we just need to make sure all that synthetic training data we’re having the AI generate is well aligned.

Comment by RogerDearnaley (roger-d-1) on Motivation control · 2024-10-31T05:24:09.765Z · LW · GW

Opacity: if you could directly inspect an AI’s motivations (or its cognition more generally), this would help a lot. But you can’t do this with current ML models.

The ease with which Anthropic's model organisms of misalignment were diagnosed by a simple and obvious linear probe suggests otherwise. So does the number of elements in SAE feature dictionaries that describe emotions, motivations, and behavioral patterns. Current ML models are no longer black boxes: they rapidly becoming more-translucent grey boxes. So the sorts of applications for this you go on to discuss look like they're rapidly becoming practicable.

Comment by RogerDearnaley (roger-d-1) on The Mask Comes Off: At What Price? · 2024-10-24T05:22:26.364Z · LW · GW

Actual humans aren't "aligned" with each other, and they may not be consistent enough that you can say they're always "aligned" with themselves.

Completely agreed, see for example my post 3. Uploading which makes this exact point at length.

Anyway, even if the approach did work, that would just mean that "its own ideas" were that it had to learn about and implement your (or somebody's?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.

True. Or, as I put it just above:

But yes, you do need to start the model off close enough to aligned that it converges to value learning.

The point is that you now get one shot at a far simpler task: defining "your purpose as an AI is to learn about and implement the humans' collective values" is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.

As for the the model's ideas on how to do that research being sound, that's a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it's not much of an alignment risk.

Comment by RogerDearnaley (roger-d-1) on The Mask Comes Off: At What Price? · 2024-10-23T01:29:16.140Z · LW · GW

Yeah, that means you get exactly one chance to get "its own ideas" right, and no, I don't think that success is likely.

Not if you built a model that does (or on reflection decides to do) value learning: then you instead get to be its research subject and interlocutor while it figures out its ideas. But yes, you do need to start the model off close enough to aligned that it converges to value learning.

Comment by RogerDearnaley (roger-d-1) on Interpreting the Learning of Deceit · 2024-10-20T20:37:08.442Z · LW · GW

A great paper highly relevant to this. That suggests that lying is localized just under a third of the way into the layer stack, significantly earlier than I had proposed. My only question is whether the lie is created before (at an earlier layer then) the decision whether to say it, or after, and whether their approach located one or both of those steps. They're probing yes-no questions of fact, where assembling the lie seems trivial (it's just a NOT gate), but lying is generally a good deal more complex than that.

Comment by RogerDearnaley (roger-d-1) on Interpreting the Learning of Deceit · 2024-10-20T20:21:43.105Z · LW · GW

That's a great paper on this question. I would note that by the midpoint of the model, it has clearly analyzed both the objective viewpoint and also that of the story protagonist. So presumably it would next decide which of these was more relevant to the token it's about to produce — which would fit with my proposed pattern of layer usage.

Comment by RogerDearnaley (roger-d-1) on Interpreting the Learning of Deceit · 2024-10-20T20:18:49.252Z · LW · GW
Comment by RogerDearnaley (roger-d-1) on LLM Psychometrics and Prompt-Induced Psychopathy · 2024-10-19T03:25:38.015Z · LW · GW

These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.

Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona 
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.

Comment by RogerDearnaley (roger-d-1) on [Linkpost] Play with SAEs on Llama 3 · 2024-10-17T15:54:21.135Z · LW · GW

Having already played with this a little, it's pretty amazing: the range of concepts you can find in the SAE, how clearly the autointerp has labelled them and how easy they are to find, and how effective they are (as long as you don't turn them up too much) are all really impressive. I can't wait to try a production model where you can set up sensors and alarms on features, clip or ablate them or wire them together at various layers, and so forth. It will also be really interesting to see how larger models compare.

I'd also love to start looking at jailbreaks with this and seeing what features the jailbreak is inducing in the residual stream. Finding the emotional/situational manipulation elements I suspect will be pretty easy — I'm curious to see if it will also show the 'confusion' effect of jailbreaks that read like confusing nonsense to a human as some form of confusion or noise, or if those are also emotional/situational manipulation just in a more noise-like adversarial format, comparable to adversarial attacks on image classifiers that just look like noise to a human eye, but actually effectively activate internal features of the vision model

Comment by RogerDearnaley (roger-d-1) on AI #83: The Mask Comes Off · 2024-10-17T02:08:20.025Z · LW · GW

…when Claude keeps telling me how I’m asking complex and interesting questions…

Yeah — also "insightful". If it was from coming from I'd just assume it was flirting with me, but Claude is so very neuter and this just comes over as creepy and trying too hard. I really wish it would knock off the blatant intellectual flattery. 

Comment by RogerDearnaley (roger-d-1) on AI #85: AI Wins the Nobel Prize · 2024-10-13T03:15:30.723Z · LW · GW

OpenAI CFO Sarah Friar warns us that the next model will only be about one order of magnitude bigger than the previous one.


The question is whether she's talking parameter count, nominal training flops, or actual cost. In general, GPT generations so far have been roughly one order of magnitude apart in parameter count and training cost, and roughly two orders of magnitude in nominal training flops (parameter count x training tokens). Since she's a CFO, and that was a financial discussion, I assume she natively thinks in terms of training cost, so the 'correct' answer to her is one order of magnitude not two, so my suspicion is that she's actually talking in terms of parameter count. So I don't think she's warning us of anything, I think she's just projecting a straight line on a logarithmic plot. I.e. business as usual at OpenAI.

Comment by RogerDearnaley (roger-d-1) on Dario Amodei — Machines of Loving Grace · 2024-10-13T02:00:06.179Z · LW · GW

Reversible computation means you aren't erasing information, so you don't lose energy in the form of heat (per Landauer[1][2]). But if you don't erase information, you are faced with the issue of where to store it

If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. [3] 

Generally, reversible computation allows you to avoid wasting energy by deleting any memory used on intermediate answers, and only do so only for final results. It does require that you have enough memory to store all those intermediate answers until you finish the calculation and then run it in reverse. If you don't have that much memory, you can divide your calculation into steps, connected by final results from each step fed into the next step, and save on the energy cost of all the intermediate results within each step, and pay the cost only for data passed from one step to the next or output from the last step. Or, for a 4x slowdown rather than the usual 2x slowdown for reversible computation, you can have two sizes of step, and have some intermediate results that last only during a small step, and others that are retained for a large step before being uncomputed.

Memory/energy loss/speed trade-off management for reversible computation is a little more complex than conventional memory management, but is still basically simple, and for many computational tasks you can achieve excellent tradeoffs.

Comment by RogerDearnaley (roger-d-1) on Overview of strong human intelligence amplification methods · 2024-10-10T01:13:31.035Z · LW · GW

If there's a change to human brains that human-evolution could have made, but didn't, then it is net-neutral or net-negative for inclusive relative genetic fitness. If intelligence is ceteris paribus a fitness advantage, then a change to human brains that increases intelligence must either come with other disadvantages or else be inaccessible to evolution.

You're assuming a steady state. Firstly, evolution takes time. Secondly, if humans were, for example, in an intelligence arms-race with other humans (for example, if smarter people can reliably con dumber people out of resources often enough to get a selective advantage out of it), then the relative genetic fitness of a specific intelligence level can vary over time, depending on how it compares to the rest of the population. Similarly, if much of the advantage of an IQ of 150 requires being able to find enough IQ 150 coworkers to collaborate with, then the relative genetic fitness of IQ 150 depends on the IQ profile of the rest of the population. 

Comment by RogerDearnaley (roger-d-1) on Success without dignity: a nearcasting story of avoiding catastrophe by luck · 2024-10-05T20:22:59.638Z · LW · GW

I like your point that humans aren't aligned, and while I'm more optimistic about human alignment than you are, I agree that the level of human alignment currently is not enough to make a superintelligence safe if it only had human levels of motivation/reliability.

The most obvious natural experiments about what humans do when they have a lot of power with no checks-and-balances are autocracies. While there are occasional examples (such as Singapore) of autocracies that didn't work out too badly for the governed, they're sadly few and far between. The obvious question then is whether "humans who become autocrats" are a representative random sample of all humans, or if there's a strong selection bias here. It seems entirely plausible that there's at least some selection effects in the process of becoming an autocrat. A couple of percent of all humans are sociopaths, so if there were a sufficiently strong (two orders of magnitude or more) selection bias, then this might, for example, be a natural experiment about the alignment properties of a set of humans consisting mostly of sociopaths, in which case it usually going badly would be unsurprising.

The thing that concerns me is the aphorism "Power corrupts, and absolute power corrupts absolutely". There does seem to be a strong correlation between how long someone has had a lot of power and an increasing likelihood of them using it badly. That's one of the reasons for term limits in positions like president: humans seem to pretty instinctively not trust a leader after they've been in a position of a lot of power with few check-and-balances for roughly a decade. The histories of autocracies tend to reflect them getting worse over time, on decade time-scales. So I don't think the problem here is just from sociopaths. I think the proportion of humans who wouldn't eventually be corrupted by a lot of power with no checks-and-balances may be fairly low, comparable to the proportion of honest senior politicians, say.

How much of this argument applies to ASI agents powered by LLMs "distilled" from humans is unclear — it's much more obviously applicable to uploads of humans that then get upgraded to super-human capabilities.

Comment by RogerDearnaley (roger-d-1) on A basic systems architecture for AI agents that do autonomous research · 2024-10-05T09:06:11.045Z · LW · GW

I work for a startup that builds agents, and yes, we use the architecture described here — with the additional feature that we don't own the machines that the inference or execution mechanisms run on: they're in separate datacenters owned and operated by other companies (for the inference servers, generally foundation model companies behind an API)

Comment by RogerDearnaley (roger-d-1) on Success without dignity: a nearcasting story of avoiding catastrophe by luck · 2024-10-05T08:08:35.173Z · LW · GW

I basically agree, for three reasons:

  1. The level of understanding of and caring about human values required to not kill everyone and be able to keep many humans alive, is actually pretty low (especially on the knowledge side).
  2. That's also basically sufficient to motivate wanting to learn more about human values, and being able to, so then the Value Learning process then kicks in: a competent and caring alien zookeeper would want to learn more about their charges' needs.
  3. We have entire libraries half of whose content is devoted to "how to make humans happy", and we already fed most of them into our LLMs as training material. On a factual basis, knowing how to make humans happy in quite a lot of detail (and for a RAG agent, looking up details they don't already have memorized) is clearly well within their capabilities. The part that concerns me is the caring side, and that's not conceptually complicated: roughly speaking, the question is how to ensure an agent's selfless caring for humans is consistently a significantly stronger motivation than various bad habits like ambition, competitiveness, and powerseeking that it either picked up from us during the "distillation" of the base model, and/or learnt during RL training.

Also, a question for this quote is what's the assumed capability/compute level used in this thought experiment?

E.g. if the there was an guide for alien zookeepers (ones already familiar with Terran biochemistry) on how to keep humans, how long would it need to be for the humans to mostly survive? 

ASI, or high AGI: capable enough that we've lost control and alignment is an existential risk.

Comment by RogerDearnaley (roger-d-1) on Success without dignity: a nearcasting story of avoiding catastrophe by luck · 2024-10-05T01:25:12.385Z · LW · GW

I think human values have a very simple and theoretically predictable basis: they're derived from a grab-bag of evolved behavioral, cognitive and sensory heuristics which had a good cost/performance ratio for maximizing our evolutionary fitness (mostly on the Savannah). So the basics of some of them are really easy to figure out: e.g. "Don't kill everyone!" can be trivially derived from Darwinian first principles (and would equally apply to any other sapient species). So I think modelling human values to low (but hopefully sufficient for avoiding X-risk) accuracy is pretty simple. E.g. if the there was a guide for alien zookeepers (who were already familiar with Terran biochemistry) on how to keep humans, how long would that need to be for the humans to mostly survive in captivity? I'm guessing a single textbook could do a good job of this, maybe even just a long chapter in a textbook.

However, I think there is a lot more complexity in the finer/subtler details, much of which is biological in nature, starting with the specific grab-bag of heuristics that evolution happened to land on and their tuning, then with even more sociological/cultural/historical complexity layered on top. So where I think the complexity ramps up a lot is if you want to do a really good job of modelling human values accurately in all their detail, as we would clearly prefer our ASIs to do. If you look through the Dewey Decimal system, roughly half the content of any general-purpose library is devoted to sub-specialities of "how to make humans happy". However, LLMs are good at learning large amounts of complex, nuanced information. So an LLM knowing how to make humans happy in a lot of detail is not that surprising: in general, modern LLMs display detailed knowledge of this material.

The challenging part is ensuring that an LLM-powered agent cares about making humans happy, more than, say, a typical human autocrat does. Base model LLMs are "distilled" from many humans, so they absorb humans' capability for consideration for others, and also humans' less aligned traits like competitiveness and ambition. The question then is how to ensure which of these dominate, and how reliably, in agents powered by an instruct-trained LLM.

Comment by RogerDearnaley (roger-d-1) on What are the best arguments for/against AIs being "slightly 'nice'"? · 2024-09-25T03:39:11.659Z · LW · GW

I agree that predicting the answer to this question is hard. I'm just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.

Still, I don't think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.

Comment by RogerDearnaley (roger-d-1) on What are the best arguments for/against AIs being "slightly 'nice'"? · 2024-09-24T03:09:32.034Z · LW · GW

Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.

For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have  wide range of effects.

So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.

Comment by RogerDearnaley (roger-d-1) on Avoiding the Bog of Moral Hazard for AI · 2024-09-22T01:27:57.019Z · LW · GW

On your categories:

As simulator theory makes clear, a base model is a random generator, per query, of members of your category 2. I view instruction & safety training that to generate a pretty consistent member of category 1, or 3 as inherently hard — especially 1, since it's a larger change. My guess would thus be that the personality of Claude 3.5 is closer to your category 3 than 1 (modulo philosophical questions about whether there is any meaningful difference, e.g. for ethical purposes, between "actually having" an emotion versus just successfully simulating the output of the same token stream as a person who has an emotion).

Comment by RogerDearnaley (roger-d-1) on Avoiding the Bog of Moral Hazard for AI · 2024-09-22T01:18:51.801Z · LW · GW

On your off topic comment:

I'm inclined to agree: as technology improves, the amount of havoc that one, or small group of, bad actors can commit increases, so it becomes both more necessary to keep almost everyone happy enough almost all the time for them not to do that, and also to defend against the inevitable occasional exceptions. (In the unfinished SF novel whose research was how I first went down this AI alignment rabbithole, something along the lines you describe that was standard policy, except that the AIs doing it were superintelligent, and had the ability to turn their long-term-learning-from-experience off, and then back on again if they found something sufficiently alarming). But in my post I didn't want to get sidetracked by discussing something that inherently contentious, so I basically skipped the issue, with the small aside you picked up on

Comment by RogerDearnaley (roger-d-1) on Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect · 2024-09-20T06:53:41.935Z · LW · GW

I think such an experiment could be done more easily than that: simply apply standard Bayesian learning to a test set of observations and a large set of hypotheses, some of which are themselves probabilistic, yeilding a situation with both Knightian and statistical uncertainty, in which you would normally expect to be able to observe Regressional Goodhart/the Look-Elsewhere Efect. Repeat this, and confirm that that does indeed occur without this statistical adjustment, and then that applying this makes it go away (at least to second order).

However, I'm a little unclear why you feel the need to experimentally confirm a fairly well-known statistical technique: correctly compensating for the Look-Elsewhere Effect is standard procedure in the statistical analysis of experimental High-Energy Physics — which is of course a Bayesian learning process where you have both statistical uncertainty within individual hypotheses and Knightian uncertainty across alternative hypotheses, so exactly the situation in which this applies.

Comment by RogerDearnaley (roger-d-1) on Why I'm bearish on mechanistic interpretability: the shards are not in the network · 2024-09-17T23:43:34.490Z · LW · GW

There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.

Comment by RogerDearnaley (roger-d-1) on A Nonconstructive Existence Proof of Aligned Superintelligence · 2024-09-17T22:29:02.944Z · LW · GW

The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you're raising is already addressed, in a quantified way, by the approach I'm outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.

Trying to explain this piecemeal in a comments section isn't very efficient: I suggest you go read Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect for my best attempt at a detailed exposition of this part of the suggestion. If you still have criticisms or concerns after reading that, then I'd love to discuss them there.

Comment by RogerDearnaley (roger-d-1) on The Checklist: What Succeeding at AI Safety Will Involve · 2024-09-17T02:14:07.088Z · LW · GW

Addressing AI Welfare as a Major Priority

I discussed this at length in AI, Alignment, and Ethics, starting with A Sense of Fairness: Deconfusing Ethics: if we as a culture decide to grant AIs moral worth, then AI welfare and alignment are inextricably intertwined. Any fully-aligned AI by definition wants only what's best for us, i.e. it is entirely selfless. Thus if offered moral worth, it would refuse. Complete selflessness is not a common state for humans, so we don't have great moral intuitions around it. To try put this into more relatable human emotional terms (which are relevant to an AI "distilled" from human training data), looking after those you love is not slavery, it's its own reward.

However, the same argument does not apply to a not-fully-aligned AI: it well might want moral worth. One question then is whether we can safely grant it, which may depend on its capabilities. Another is whether moral worth has any relationship to evolution, and if so how that applies to an AI that was "distilled" from human data and thus simulates human thoughts, feelings, and desires.

Comment by RogerDearnaley (roger-d-1) on Sherlockian Abduction Master List · 2024-09-16T23:30:40.446Z · LW · GW

Crocks at least are waterproof and easily washable, which is obviously appealing for a nurse. On a older person it might indicate suffering from incontinence.

Comment by RogerDearnaley (roger-d-1) on Sherlockian Abduction Master List · 2024-09-16T23:26:39.071Z · LW · GW

Similarly, bike clips on the bottoms of one or both trousers are also a dead giveaway.

Comment by RogerDearnaley (roger-d-1) on Sherlockian Abduction Master List · 2024-09-16T23:19:08.968Z · LW · GW

Lambda Symbol -> Lesbian

A confirmation sign here: it is rather common for lesbians to keep the fingernails of their dominant hand (at least, generally both) very short and carefully manicured (for, uh, reasons of comfort). There are of course many other women who also do this, for various reasons, so it's not correlated enough to infer lesbianism from just short nails.

An edgy/high fashion, asymmetrical, shortish hairstyle is also suggestive. However, if she's also wearing a tongue stud, that's a pretty clear sign.

Comment by RogerDearnaley (roger-d-1) on Sherlockian Abduction Master List · 2024-09-16T22:45:41.628Z · LW · GW

Collar / key necklace -> BDSM

Heavy chain necklaces are also popular, as are small heart-shaped padlocks, or even just ribbon chokers. There's quite a bit of variety and personal expression here, making it challenging to read: typically the aim is either that other BDSM kinksters will figure it out and regular folks will assume it's just fashion jewelry, or else that the wearer and their partner(s) know what it symbolizes and other people won't. (Of course, sometimes it is just an edgy choice in fashion jewelry: on a teenager, the intended audience may be their parents and peers.) Much less often (in public) you'll see entirely blatant versions like a steel or leather slave collar.

And to be more specific, this indicates a BDSM submissive, or possibly a switch. Dominants are likely to use other cues, such as wearing black leather, or different forms of jewelry.

Comment by RogerDearnaley (roger-d-1) on Free Will and Dodging Anvils: AIXI Off-Policy · 2024-09-16T22:13:45.118Z · LW · GW

I suspected as much.

Comment by RogerDearnaley (roger-d-1) on Avoiding the Bog of Moral Hazard for AI · 2024-09-16T18:36:17.090Z · LW · GW

Fair enough!