Posts
Comments
They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don't without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even "thinking step by step," there are still plenty of issues.
Tokenization is definitely a contributor. But that doesn't really support the notion that there's an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.
I'm not totally sure the hypothesis is well-defined enough to argue about, but maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?
If the internals were like a human thinking about the question and then giving an answer, it would probably be able to add numbers more reliably. And I also suspect the pattern of mistakes doesn't look typical for a human at any developmental stage (once a human can add 3 digit numbers their success rate at 5 digit numbers is probably pretty good). I vaguely recall some people looking at this, but gave forgotten the reference, sorry.
A different question: When does it make your (mental) life easier to categorize an AI as conscious, so that you can use the heuristics you've developed about what conscious things are like to make good judgments?
Sometimes, maybe! Especially if lots of work has been put in to make said AI behave in familiar ways along many axes, even when nobody (else) is looking.
But for LLMs, or other similarly alien AIs, I expect that using your usual patterns of thought for conscious things creates more problems than it helps with.
If one is a bit Platonist, then there's some hidden fact about whether they're "really conscious or not" no matter how murky the waters, and once this Hard problem is solved, deciding what to do is relatively easy.
But I prefer the alternative of ditching the question of consciousness entirely when it's not going to be useful, and deciding what's right to do about alien AIs more directly.
Interesting stuff, but I I felt like your code was just a bunch of hard-coded suggestively-named variables with no pattern-matching to actually glue those variables to reality. I'm pessimistic about the applicability - better to spend time thinking on how to get an AI to do this reasoning in a way that's connected to reality from the get-go.
Exciting stuff, thanks!
It's a little surprising to me how bad the logit lens is for earlier layers.
I was curious about the context and so I went over and ctrl+F'ed Solomonoff and found Evan saying
I think you're misunderstanding the nature of my objection. It's not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it's that the reasoning in this post is mathematically unsound, and I'm using the formalism to show why. If I weren't responding to this post specifically, I probably wouldn't have brought up Solomonoff induction at all.
Thank you for posting this, and it was interesting. Also, I think the middle section is bad.
Basically starting from Lance taking a digression out of an anthropomorphic argument to castigate those who think AI might do bad things for anthropomorphising, and ending with the end of all discussion of Solomonoff induction, I think there was a lot of misconstruing ideas or arguing against nonexistent people.
Like, I personally don't agree with people who expect optimization daemons to arise in gradient descent, but I don't say they're motivated by whether the Solomonoff prior is malign.
Oh, maybe I've jumped the gun then. Whoops.
Congrats to Paul on getting appointed to NIST AI safety.
At a high level, you don't get to pick the ontology.
This post seems like a case of there being too many natural abstractions.
Had a chat with @Logan Riggs about this. My main takeaway was that if SAEs aren't learning the features for separate squares, it's plausibly because in the data distribution there's some even-more-sparse pattern going on that they can exploit. E.g. if big runs of same-color stones show up regularly, it might be lower-loss to represent runs directly than to represent them as made up of separate squares.
If this is the bulk of the story, then messing around with training might not change much (but training on different data might change a lot).
Nice! This was a very useful question to ask.
Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P
Oh well, I'll just leave it as is as a monument to bad comments.
offline RL is surprisingly stable and robust to reward misspecification
Wow, what a wild paper. The basic idea - that "pessimism" about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function - is a fairly obvious one. But what's not obvious is the wide variety of algorithms this applies to.
I genuinely don't believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don't work! (For these tasks.)
I think it's pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).
Does it know today's date through API call? That's definitely a smoking gun.
Oh, missed that part.
The idea that it's usually monitored is in my prompt; everything else seems like a pretty convergent and consistent character.
It seems likely that there's a pre-prompt from google with the gist of "This is a conversation between a user and Claude 3, an AI developed by Anthropic. Text between the <start ai> and <end ai> tokens was written by the AI, and text between the <start user> and <end user> tokens was written by the human user."
(edited to not say Anthropic is Google)
Neat, thanks. Later I might want to rederive the estimates using different assumptions - not only should the number of active features L be used in calculating average 'noise' level (basically treating it as an environment parameter rather than a design decision), but we might want another free parameter for how statistically dependent features are. If I really feel energetic I might try to treat the per-layer information loss all at once rather than bounding it above as the sum of information losses of individual features.
My guess is this is a defense of someone being mocked on twitter, and so we aren't really getting (or care about) the context.
I watched half of part 4, and called it quits. I think I'd have to be more into philosophy's particular argumentation game, and also into philosophy of religion, and also into dunking on people on facebook.
Yeah, "graduate student" can mean either Masters or PhD student.
Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.
E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).
In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.
I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.
To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.
Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.
Hm. Okay, I remembered a better way to improve efficiency: neighbor lists. For each feature, remember a list of who its closest neighbors are, and just compute your "closeness loss" by calculating dot products in that list.
The neighbor list itself can either be recomputed once in a while using the naive method, or you can accelerate the neighbor list recomputation by keeping more coarse-grained track of where features are in activation-space.
Quadratic complexity isn't that bad, if this is useful. If your feature vectors are normalized you can do it faster by taking a matrix product of the weights "the big way" and just penalizing the trace for being far from ones or zeros. I think?
Feature vector normalization is itself an examble of a quadratic thing that makes it in.
I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"
One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.
But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.
A separate issue I have here is the assumption that you don't have to worry about teaching an AI to make human-friendly decisions if you're using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big "trust us, we're the good guys" vibe.
Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place - and not merely "good" in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.
And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that's 99% "good guys." But maybe this is a strategic disagreement more than a factual one.
Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.
This was a cool, ambitious idea. I'm still confused about your brain score results. Why did the "none" fine-tuned models have good results? Were none of your moddels succesful at learning the brain data?
I got 7/18.
See the discussion section.
We have developed a physics-based, and observable early warning signal characterizing the tipping point of the AMOC: the minimum of the AMOC-induced freshwater transport at 34°S in the Atlantic, here indicated by FovS. The FovS minimum occurs 25 years (9 to 41, 10 and 90% percentiles) before the AMOC tipping event. The quantity FovS has a strong basis in conceptual models, where it is an indicator of the salt-advection feedback strength. Although FovS has been shown to be a useful measure of AMOC stability in GCMs, the minimum feature has so far not been connected to the tipping point because an AMOC tipping event had up to now not been found in these models. The FovS indicator is observable, and reanalysis products show that its value and, more importantly, its trend are negative at the moment. The latest CMIP6 model simulations indicate that FovS is projected to decrease under future climate change. However, because of freshwater biases, the CMIP6 FovS mean starts at positive values and only reaches zero around the year 2075. Hence, no salt-advection feedback–induced tipping is found yet in these models under climate change scenarios up to 2100 and longer simulations under stronger forcing would be needed (as we do here for the CESM) to find this. In observations, the estimated mean value of FovS is already quite negative, and therefore, any further decrease is in the direction of a tipping point (and a stronger salt-advection feedback). A slowdown in the FovS decline indicates that the AMOC tipping point is near.
Model year 1750 does not mean 1750 years from now. The model is subtly different from reality in several ways. Their point is they found some indicator (this FovS thing) that hits a minimum a few decades before the big change, in a way that maybe generalizes from the model to reality.
In the model, this indicator starts at 0.20, slowly decreases, and hits a minimum at -0.14 of whatever units, ~25 years before the AMOC tipping point.
In reality, this indicator was already at -0.5, and is now somewhere around -0.1 or -0.15.
This is a bit concerning, although to reiterate, the model is subtly different from reality in several ways. Exact numerical values don't generalize that well, it's the more qualitative thing - the minimum of their indicator - that has a better chance of warning us, and we have not (as far as we can tell) hit a minimum. Yet.
There's a huge amount of room for you to find whatever patterns are most eye-catching to you, here.
I was sampling random embeddings at various distances from the centroid and prompting GPT-J to define them. One of these random embeddings, sampled at distance 5, produced the definition [...]
How many random embeddings did you try sampling, that weren't titillating? Suppose you kept looking until you found mentions of female sexuality again - would this also sometimes talk about holes, or would it instead sometimes talk about something totally different?
How would the AI do something like this if it ditched the idea that there existed some perfect U*?
Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts "believing in magic" and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.
Is this an alignment approach? How does it solve the problem of getting the AI to do good things and not bad things? Maybe this is splitting hairs, sorry.
It's definitely possible to build AI safely if it's temporally and spatially restricted, if the plans it optimizes are never directly used as they were modeled to be used but are instead run through processing steps that involve human and AI oversight, if it's never used on broad enough problems that oversight becomes challenging, and so on.
But I don't think of this as alignment per se, because there's still tremendous incentive to use AI for things that are temporally and spatially extended, that involve planning based on an accurate model of the world, that react faster than human oversight allows, that are complicated domains that humans struggle to understand.
My take is that they work better the more that the training distribution anticipates the behavior we want to incentivize, and also the better that humans understand what behavior they're aiming for.
So if used as a main alignment technique, they only work in a sort of easy-mode world, where if you get a par-human AI to have kinda-good behavior on the domain we used to create it, that's sufficient for the human-AI team to do better at creating the next one, and so on until you get a stably good outcome. A lot like the profile of RLHF, except trading off human feedback for AI generalization.
I think the biggest complement to activation steering is research on how to improve (from a human perspective) the generalization of AI internal representations. And I think a good selling point for activation steering research is that the reverse is also true - if you can do okay steering by applying a simple function to some intermediate layer, that probably helps do research on all the things that might make that even better.
Overall, though, I'm not that enthusiastic about it as a rich research direction.
Nice post.
the simplistic view that IRL agents hold about ground truth in human values (ie. the human behavior they’re observing is always perfectly displaying the values)
IRL typically involves an error model - a model of how humans make errors. If you've ever seen the phrase "Boltzmann-rational" in an IRL paper, it's the assumption that humans most often do the best thing but can sometimes do arbitrarily bad things (just with an exponentially decreasing probability).
This is still simplistic, but it's simplistic on a higher level :P
If you haven't read Reducing Goodhart, it's pretty related to the topic of this post.
Ultimately I'm not satisfied with any proposals we have so far. There's sort of a philosophy versus engineering culture difference, where in philosophy we'd want to hoard all of these unsatisfying proposals, and occasionally take them out of their drawer and look at them again with fresh eyes, while in engineering the intuition would be that the effort is better spent looking for ways to make progress towards new and different ideas
I think there's a divide here between implementing ethics, and implementing meta-ethics. E.g. trying to give rules for how to weight your past and future selves, vs. trying to give rules for what good rules are. When in doubt, shift gears towards implementing metaethics: it's cheaper because we don't have the time to write down a complete ethics for an AI to follow, it's necessary because we can't write down a complete ethics for an AI to follow, and it's unavoidable because AIs in the real world will naturally do meta-ethics.
To expand on that last point - a sufficiently clever AI operating in the real world will notice that it itself is part of the real world. Actions like modifying itself are on the table, and have meta-ethical implications. This simultaneously makes it hard to prove convergence for any real-world system, while also making it seem likely that all sufficiently clever AIs in the real world will converge to a state that's stable under consideration of self-modifying actions.
My impression is that this is motivated reasoning.
I'm reminded of https://slatestarcodex.com/2016/11/05/the-pyramid-and-the-garden/ . Dismissing complicated arguments out of hand, and instead directing attention to a single fact that is either meaningful or a 1 in 100 coincidence, doesn't have a great track record.
One thing that's always seemed important, but that I don't know how to fit in, is the ecological equilibrium. E.g. it seems like the Chicken game (payoff matrix (((0,0),(1,2)),((2,1),(0,0))) ) supports an ecosystem of different strategies in equilibrium. How does this mesh with any particular decision theory?
Nice list. The hidden techincal problem is how you become confident that the AI isn't just telling you what you want to hear. (Where it can say the right words when you ask but will do the wrong thing when you're not looking.)
Bah, nobody's mentioned social applications of superhuman planning yet.
You could let an AI give everyone subtle nudges, and a month later everone's social life will be great. You'll see your family the right amount, you'll have friends who really get you who you see often and do fun things with. Sex will occur. Parties and other large gatherings will be significantly better.
The people to make this possible are all around us, it's just really hard to figure out how to make it happen.
I think the first title was more accurate. There is inherent dual use potential in alignment research.
But that doesn't mean that the good outcome is impossible, or even particularly unlikely. AI developers are quite willing to try to use AI for the benefit of humanity (especially when lots of people are watching them), and governments are happy to issue regulations to that effect (though effectiveness will vary).
Or to put it another way, the outcome is not decided. There are circumstances that make the good outcomes more likely, and there are actions we can take to try and steer the future in that direction.
No AGI research org has enough evil to play it that way.
We shouldn't just assume this, though. Power corrupts. Suppose that you are the CEO of an AI company, and you want to use the AGI your company is developing to fulfill your preferences and not anyone else's. Sit down and think for a few minutes about what obstacles you would face, and how you as a very clever person might try to overcome or subvert those obstacles.
Yeah, good point (the examples, not necessarily any jargon-ful explanation of them). Sound waves, or even better, slow-moving vortices, or also better and different, diffusion of a cloud of one gas through a room filled with a different gas, show that you don't get total mixing of a room on one-second timescale.
I think most likely, I've mangled something in the process of extrapolating a paper on a tiny toy model of a few hundred gas atoms to the meter scale.
Do you know how to interpret "maximum divergence" in this context?
Hm, this is a good question.
In writing my original reply, I figured "maximum divergence" was a meter. You start with two trajectories an angstrom apart, and they slowly diverge, but they can't diverge more than 1 meter.
I think this is true if you're just looking at the atom that's shifted, but not true if you look at all the other atoms as well. Then maybe we actually have a 10^24-dimensional state space, and we've perturbed the state space by 1 angstrom in 1 dimension, and "maximum divergence" is actually more like the size of state space ( meters).
In which case it actually takes two tenths of a second for exponential chaos to go from 10^-10 to 10^12.
Also, IIRC aren't there higher-order exponents that might decay slower.
Nah, I don't think that's super relevant here. All the degrees of freedom of the gas are coupled to each other, so the biggest source of chaos can scramble everything just fine.
https://www.sciencedirect.com/science/article/abs/pii/S1674200121001279
They find Lyapunov exponent of about 1 or 2 (where time is basically in units of time it takes for a particle at average velocity to cover the length of the box).
For room temp gas, this timescale is about 1/400 seconds. So the divergence after 20 seconds should increase by a factor of over e^8000 (until it hits the cieling of maximum possible divergence).
Since an Angstrom is only 10^-10 m, if you start with an Angstrom offset, the divergence reaches maximum by about a tenth of a second.
But for the continuous limit the subagents become similar to each other at the same rate as they become more numerous. It seems intuitive to me that with a little grinding you could get a decision-making procedure whose policy is an optimum of an integral over "subagents" who bet on the button being pushed at different times, and so the whole system will change behavior upon an arbitrarily-timed press of the button.
Except I think in continuous time you probably lose guarantees about the system not manipulating humans to press/not press the button. Unless maybe each subagent believes the button can only be pressed exactly at their chosen time. But this highlights that maybe all of these counterfactuals give rise to really weird worlds, that in turn will give rise to weird behavior.
I'm reminded of the Real Genius scene where they're celebrating building the death laser and Mitch says "Let the engineers figure out a use for it, that's not our concern."
Which in turn reminds me of "Once the rockets go up, who cares where they come down? That's not my department, says Werner von Braun."
Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.
Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)
The history is a little murky to me. When I wrote [what's the dream for giving natural-language commands to AI](https://www.lesswrong.com/posts/Bxxh9GbJ6WuW5Hmkj/what-s-the-dream-for-giving-natural-language-commands-to-ai), I think I was trying to pin down and critique (a version of) something that several other people had gestured to in a more offhand way, but I can't remember the primary sources. (Maybe Rohin's alignment newsletter between the announcement of GPT2 and then would contain the relevant links?)