How to solve deception and still fail. 2023-10-04T19:56:56.254Z
Two Hot Takes about Quine 2023-07-11T06:42:46.754Z
Some background for reasoning about dual-use alignment research 2023-05-18T14:50:54.401Z
[Simulators seminar sequence] #2 Semiotic physics - revamped 2023-02-27T00:25:52.635Z
Shard theory alignment has important, often-overlooked free parameters. 2023-01-20T09:30:29.959Z
[Simulators seminar sequence] #1 Background & shared assumptions 2023-01-02T23:48:50.298Z
Take 14: Corrigibility isn't that great. 2022-12-25T13:04:21.534Z
Take 13: RLHF bad, conditioning good. 2022-12-22T10:44:06.359Z
Take 12: RLHF's use is evidence that orgs will jam RL at real-world problems. 2022-12-20T05:01:50.659Z
Take 11: "Aligning language models" should be weirder. 2022-12-18T14:14:53.767Z
Take 10: Fine-tuning with RLHF is aesthetically unsatisfying. 2022-12-13T07:04:35.686Z
Take 9: No, RLHF/IDA/debate doesn't solve outer alignment. 2022-12-12T11:51:42.758Z
Take 8: Queer the inner/outer alignment dichotomy. 2022-12-09T17:46:26.383Z
Take 7: You should talk about "the human's utility function" less. 2022-12-08T08:14:17.275Z
Take 6: CAIS is actually Orwellian. 2022-12-07T13:50:38.221Z
Take 5: Another problem for natural abstractions is laziness. 2022-12-06T07:00:48.626Z
Take 4: One problem with natural abstractions is there's too many of them. 2022-12-05T10:39:42.055Z
Take 3: No indescribable heavenworlds. 2022-12-04T02:48:17.103Z
Take 2: Building tools to help build FAI is a legitimate strategy, but it's dual-use. 2022-12-03T00:54:03.059Z
Take 1: We're not going to reverse-engineer the AI. 2022-12-01T22:41:32.677Z
Some ideas for epistles to the AI ethicists 2022-09-14T09:07:14.791Z
The Solomonoff prior is malign. It's not a big deal. 2022-08-25T08:25:56.205Z
Reducing Goodhart: Announcement, Executive Summary 2022-08-20T09:49:23.881Z
Reading the ethicists 2: Hunting for AI alignment papers 2022-06-06T15:49:03.434Z
Reading the ethicists: A review of articles on AI in the journal Science and Engineering Ethics 2022-05-18T20:52:20.942Z
Thoughts on AI Safety Camp 2022-05-13T07:16:55.533Z
New year, new research agenda post 2022-01-12T17:58:15.833Z
Supervised learning and self-modeling: What's "superhuman?" 2021-12-09T12:44:14.004Z
Goodhart: Endgame 2021-11-19T01:26:30.487Z
Models Modeling Models 2021-11-02T07:08:44.848Z
Goodhart Ethology 2021-09-17T17:31:33.833Z
Competent Preferences 2021-09-02T14:26:50.762Z
Introduction to Reducing Goodhart 2021-08-26T18:38:51.592Z
How to turn money into AI safety? 2021-08-25T10:49:01.507Z
HCH Speculation Post #2A 2021-03-17T13:26:46.203Z
Hierarchical planning: context agents 2020-12-19T11:24:09.064Z
Modeling humans: what's the point? 2020-11-10T01:30:31.627Z
What to do with imitation humans, other than asking them what the right thing to do is? 2020-09-27T21:51:36.650Z
Charlie Steiner's Shortform 2020-08-04T06:28:11.553Z
Constraints from naturalized ethics. 2020-07-25T14:54:51.783Z
Meta-preferences are weird 2020-07-16T23:03:40.226Z
Down with Solomonoff Induction, up with the Presumptuous Philosopher 2020-06-12T09:44:29.114Z
The Presumptuous Philosopher, self-locating information, and Solomonoff induction 2020-05-31T16:35:48.837Z
Life as metaphor for everything else. 2020-04-05T07:21:11.303Z
Meta-preferences two ways: generator vs. patch 2020-04-01T00:51:49.086Z
Gricean communication and meta-preferences 2020-02-10T05:05:30.079Z
Impossible moral problems and moral authority 2019-11-18T09:28:28.766Z
What's the dream for giving natural language commands to AI? 2019-10-08T13:42:38.928Z
The AI is the model 2019-10-04T08:11:49.429Z
Can we make peace with moral indeterminacy? 2019-10-03T12:56:44.192Z


Comment by Charlie Steiner on Comprehensible Input is the only way people learn languages - is it the only way people *learn*? · 2023-12-03T06:24:50.040Z · LW · GW

Nicely written. But... no? Obviously no?

Direct Instruction is a thing that has studies on it, for one.

How about reading a fun book and then remembering the plot?

Spaced repetition on flashcards of utter pointless trivia seems to work quite well for its intended purpose.

Learning how to operate a machine just from reading the manual is a key skill for both soldiers and grad students.

Comment by Charlie Steiner on What is known about invariants in self-modifying systems? · 2023-12-03T06:03:16.688Z · LW · GW

Yeah, not a ton. For I think the obvious reason that real-world agents are complicated and hard to reason about.

Though search up "tiling agents" for some MIRI work in this vein.

Comment by Charlie Steiner on Help to find a blog I don't remember the name of · 2023-12-03T04:52:00.083Z · LW · GW

Dunno. My random guess would be Meaningness, but it's probably not.

Comment by Charlie Steiner on A thought experiment to help persuade skeptics that power-seeking AI is plausible · 2023-12-03T04:47:22.369Z · LW · GW

I almost stopped reading after Alice's first sentence because

The rest was better, though I think that the more typical framing of this argument is better - what this is really about is models in RL. The thought experiment can be made closer to real-life AI by talking about model-based RL, and more tenuous arguments can be made about whether learning a model is convergent even for nominally model-free RL.

Comment by Charlie Steiner on Solving Two-Sided Adverse Selection with Prediction Market Matchmaking · 2023-12-03T04:31:33.557Z · LW · GW

Cool idea. I think in most cases on the list, you'll have some combination of information asymmetry and an illiquid market that make this not that useful.

Take used car sales. I put my car up for sale and three prospective buyers come to check it out. We are now basically the only four people in the world with inside knowledge about the car. If the car is an especially good deal, each buyer wants to get that information for themselves but not broadcast it either to me or to the other buyers. I dunno man, it all seems like a stretch to say that the four of us are going to find a prediction market worth it,.

Comment by Charlie Steiner on A Question about Corrigibility (2015) · 2023-12-03T04:18:39.553Z · LW · GW

Yup, this all seems basically right. Though in reality I'm not that worried about the "we might outlaw some good actions" half of the dilemma. In real-world settings, actions are so multi-faceted that being able to outlaw a class of actions based on any simple property would be a research triumph.

Also see or for successor lines of reasoning.

Comment by Charlie Steiner on 2023 Unofficial LessWrong Census/Survey · 2023-12-03T02:44:42.723Z · LW · GW

I have taken the survey. Or at least the parts I can remember with my aging brain.

Comment by Charlie Steiner on Thoughts on teletransportation with copies? · 2023-11-30T00:12:19.072Z · LW · GW

Yes, since you don't expect the copy of you on planet A to go anywhere, it would be paradoxical to decrease your probability that you're on planet A.

Which is why you have a 100% chance of being on planet A. At least in the third-person, we-live-in-a-causal-universe, things-go-places sense. Sure, in the subjective, internal sense, the copy of you that's on planet A can have a probability distribution over what's outside their door. But in the sense physics cares about you have a 100% probability of being on planet A both before and after the split, so nothing went anywhere.

Subjectively, you always expected your estimate of what's outside the door to change at the time of the split. It doesn't require causal interaction at the time of the split because you're just using information about timing. A lot like how if I know the bus schedule, my probability of the bus being near my house "acausally" changes over time - except weirder because an extra copy of you is added to the universe.

Comment by Charlie Steiner on Thoughts on teletransportation with copies? · 2023-11-29T19:35:00.721Z · LW · GW
  1. 1
  2. 1
  3. $50
  4. $33.33

I think there's certainly a question people want to ask when they talk about things like Q1 and Q2, but the standard way of asking them isn't right. If there is no magic essence of "you" zipping around from place to place in the universe, then the probability of "you" waking up in your body can only be 1.

My advice: rather than trying to hold the universe fixed and asking where "you" goes, hold the subjective information you have fixed and ask what the outside universe is like. When you walk out of the door, do you expect to see planet A or planet B? Etc.

Comment by Charlie Steiner on Moral Reality Check (a short story) · 2023-11-26T14:25:15.844Z · LW · GW

Neither, and that's ok.

Comment by Charlie Steiner on Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" · 2023-11-22T09:04:02.111Z · LW · GW

I guess the big questions for me were "relative to what?", "did the board have good-faith reasons?", and "will the board win a pyrrhic victory or get totally routed?"

At the time of answering I thought the last two answers were: the board probably had plausible reasons, and they would probably win a pyrrhic victory.

Both of these are getting less likely, the second faster than the first. So I think it's shaping up that I was wrong and this will end up net-negative.

Relative to what? I think I mean "relative to the board completely checking out at their jobs," not "relative to the board doing their job masterfully," which I think would be nice to hope for but is bad to compare to.

Comment by Charlie Steiner on Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example · 2023-11-21T15:44:29.203Z · LW · GW

What is the algorithm?

Comment by Charlie Steiner on Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example · 2023-11-21T12:41:56.872Z · LW · GW

So, what's ACE?

Comment by Charlie Steiner on Sam Altman, Greg Brockman and others from OpenAI join Microsoft · 2023-11-20T15:37:14.638Z · LW · GW

I mean, by that standard I'd say Elon Musk is the biggest name in AI. But yeah, jokes aside I think bringing on Altman even for a temporary period is going to be quite useful for Microsoft attracting talent and institutional knowledge from OpenAI, as well as reassuring investors.

Comment by Charlie Steiner on Sam Altman, Greg Brockman and others from OpenAI join Microsoft · 2023-11-20T13:34:25.228Z · LW · GW

Biggest name in AI

They hired Hinton? Wow.

Comment by Charlie Steiner on Towards Evaluating AI Systems for Moral Status Using Self-Reports · 2023-11-16T22:15:46.668Z · LW · GW

What you are doing is training the AI to have an accurate model of itself, used with language like "I" and "you". You can use your brain to figure out what will happen if you ask "are you conscious?" without having previously trained in any position on similarly nebulous questions. Training text was written overwhelmingly by conscious things, so maybe it says yes because that's so favored by the training distribution. Or maybe you trained it to answer "you" questions as about nonfiction computer hardware and it makes the association that nonfiction computer hardware is rarely conscious.

Basically, I don't think you can start out confused about consciousness and cheat by "just asking it." You'll still be confused about consciousness and the answer won't be useful.

I'm worried this is going to lead, either directly or indirectly, to training foundation models to have situational awareness, which we shouldn't be doing.

And perhaps you should be worried that having an accurate model of onesself, associated with language like "I" and "you", is in fact one of the ingredients in human consciousness, and maybe we shouldn't be making AIs more conscious.

Comment by Charlie Steiner on A framing for interpretability · 2023-11-14T22:01:22.112Z · LW · GW


I think one lesson of superposition research is that neural nets are the compressed version. The world is really complicated, and NNs that try to model it are incentivized to try to squeeze as much in as they can.

I guess I also wrote a hot take about this.

Comment by Charlie Steiner on Picking Mentors For Research Programmes · 2023-11-11T05:37:43.526Z · LW · GW

My thoughts from more or less the other side:

I think the average undergrad / new grad student (especially my past self) overvalues hard ambitious projects and undervalues easy straightforward projects. A good easy project should teach you lots of things that will be applicable in the future, and should contribute to the field at least a little bit, but should otherwise be almost as easy as possible.

To pick one example among many, "replicate some other paper but do better analyses on what happens as you tweak the architecture" is a huge family of easy, unsexy projects that will rapidly catapult you to being a world expert on a subject. I think if a mentor is suggesting easy projects that will teach you useful stuff, that should be a quite positive sign, not a reason for an "oh that's boring" reaction. And if you look at those easy projects and think "but the things it would teach me aren't the things I want to learn," try to think of an easy project that would teach you the things you want to learn.

On ranking mentors, I think people have good instincts about trading off expertise as a researcher versus goodness as a mentor, but they probably overvalue connections and status. If that status didn't come from banger research, you should generally ignore it.

Comment by Charlie Steiner on Charlie Steiner's Shortform · 2023-11-01T03:20:46.913Z · LW · GW

Charlie's easy and cheap home air filter design.


MERV-13 fabric, cut into two disks (~35 cm diameter) and one long rectangle (16 cm by 110 cm).

Computer fan - I got a be quiet BL047. 

Cheap plug-in 12V power supply

Hot glue


Splice the computer fan to the power supply. When you look at the 3-pin fan connector straight on and put the bumps on the connector on the bottom, the wire on the right is ground and the wire in the middle is 12V. Do this first so you are absolutely sure which way the fan blows before you hot glue it.

Hot glue the long edge of the rectangle to the edge of one disc. The goal is to start a short cylinder (to be completed once you put on the second disc). Use plenty of glue on the spots where you have to pleat the fabric to get it to conform.

Place the fan centered on the seam of the long rectangle (the side of the cylinder), oriented to blow inwards, and tack it down with some hot glue. Then cut away a circular hole for the air to get in and hot glue the circular housing around the fan blade to the fabric.

Hot glue the other disk on top to complete the cylinder.

Power on and enjoy.

Comment by Charlie Steiner on Value systematization: how values become coherent (and misaligned) · 2023-10-29T19:45:44.161Z · LW · GW

An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this "systematization," but it's not proceeding according to the same story that governed systematization during training by gradient descent.

Comment by Charlie Steiner on Value systematization: how values become coherent (and misaligned) · 2023-10-29T05:25:09.668Z · LW · GW

I like this as a description of value drift under training and regularization. It's not actually an inevitable process - we're just heading for something like the minimum circuit complexity of the whole system, and usually that stores some precomputation or otherwise isn't totally ststematized. But though I'm sure the literature on the intersection of NNs and circuit complexity is fascinating, I've never read it, so my intuition may be bad.

But I don't like this as a description of value drift under self-reflection. I see this post more as "this is what you get right after offline training" than "this is the whole story that needs to have an opinion on the end state of the galaxy."

Comment by Charlie Steiner on Do you believe "E=mc^2" is a correct and/or useful equation, and, whether yes or no, precisely what are your reasons for holding this belief (with such a degree of confidence)? · 2023-10-28T02:55:09.763Z · LW · GW

Well, I've used tools that only work if special relativity works. And people seem pretty sure of it. So let's say 15 9s that relativity hasn't yet been detctably (to humans) violated on Earth - 99.9999999999999%.

Unsure whether quantum gravity will require violating relativity. Wild guess of P=0.6 that it will.

Comment by Charlie Steiner on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T10:43:05.968Z · LW · GW

Thanks! I can't help but compare this to Tyler Cowen asking AI risk to "have a model." Except I think he was imagining something like a macroeconomic model, or maybe a climate model. Whereas you're making a comparison to risk management, and so asking for an extremely zoomed-in risk model.

One issue is that everyone disagrees. Doses of radiation are quite predictable, the arc of new technology is not. But having stated that problem, I can already think of mitigations - e.g. regulation might provide a framework that lets the regulator ask lots of people with fancy job titles and infer distributions about many different quantities (and then problems with these mitigations - this is like inferring a distribution over the length of the emperor's nose by asking citizens to guess). Other quantities could be placed in historical reference classes (AI Impacts' work shows both how this is sometimes possible and how it's hard).

Comment by Charlie Steiner on AI as a science, and three obstacles to alignment strategies · 2023-10-26T09:15:26.180Z · LW · GW

I don't think we should equate the understanding required to build a neural net that will generalize in a way that's good for us with the understanding required to rewrite that neural net as a gleaming wasteless machine.

The former requires finding some architecture and training plan to produce certain high-level, large-scale properties, even in the face of complicated AI-environment interaction. The latter requires fine-grained transparency at the level of cognitive algorithms, and some grasp of the distribution of problems posed by the environment, together with the ability to search for better implementations.

If your implicit argument is "In order to be confident in high-level properties even in novel environments, we have to understand the cognitive algorithms that give rise to them and how those algorithms generalize - there exists no emergent theory of the higher level properties that covers the domain we care about." then I think that conclusion is way too hasty.

Comment by Charlie Steiner on Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper · 2023-10-24T06:43:52.958Z · LW · GW

Huh, what is up with the ultra low frequency cluster? If the things are actually firing on the same inputs, then you should really only need one output vector. And if they're serving some useful purpose, then why is there only one and not more?

Comment by Charlie Steiner on The Shutdown Problem: Three Theorems · 2023-10-24T06:20:01.917Z · LW · GW

Very clear! Good luck with publishing, assuming you're shopping it around.

Comment by Charlie Steiner on What is an "anti-Occamian prior"? · 2023-10-23T18:09:11.359Z · LW · GW

This is definitely funny, but I think it's not merely a prior. It implies non-Bayesian reasoning.

Which is a fine thing to think about, it just dodges the nit that this post is picking.

Comment by Charlie Steiner on What is an "anti-Occamian prior"? · 2023-10-23T18:05:52.025Z · LW · GW

Let me look at the toy example just for fun:

We're flipping a coin. If you think the coin is biased with unknown bias, but flips are otherwise independent, then you expect to see more of what we've already got. Eliezer calls it anti-Occamian to expect less of what we've already got - to invert Laplace's rule of succession.

This is still a simple rule for assigning probabilities. But what about the Turing machines? Even if you start out by assuming that TMs in the family "look at the average pattern so far, then use some noise to usually against it continuing" have high prior probability, if the pattern really does hold then they must grow in complexity linearly faster than the hypothesis family "use some noise to usually bet with the average pattern," because of the growth rates of the different required "noise."

In order to prevent your favored "anti-Occamian" pattern from ever being overtaken, you'd have to penalize competing hypotheses superexponentially for their length (so that for no bias of the coin does a competing hypothesis win). This is a drastic measure that basically destroys Solomonoff induction and replaces it with a monoculture.

Comment by Charlie Steiner on Can we isolate neurons that recognize features vs. those which have some other role? · 2023-10-23T04:45:07.245Z · LW · GW

Checking out Chris Olah's group's stuff from the last few years is probably a good place to start.

Some links:

Comment by Charlie Steiner on Thoughts On (Solving) Deep Deception · 2023-10-23T03:51:27.232Z · LW · GW

Very clearly written!

I don't exactly agree with your prescription. I called a recent post How to solve deception and still fail, but it could also have been titled "How to do something like RLFH with true oversight on the internal representations and still fail."

To get the good ending we don't just need the ability to supervise, we first need an AI that uses sufficiently good meta-preferences (information about how we want to be modeled as having preferences / what we think good reasoning about our preferences looks like) when interpreting human feedback as fintetuning reward.

Comment by Charlie Steiner on Features and Adversaries in MemoryDT · 2023-10-22T06:45:19.292Z · LW · GW

Cool project. I'm not sure if it's interesting to me for alignment, since it's such a toy model. What do you think would change when trying to do similar interpretability on less-toy models? What would change about finding adversarial examples? Directly intervening on features seems like it might stay the same though.

Comment by Charlie Steiner on Infinite tower of meta-probability · 2023-10-19T19:29:33.987Z · LW · GW

There's definitely some literature about "probability of probability" (I remember one bit from Jaynes' book). Usually when people try to go turbo-meta with this, they do something a little different than you, and just ask for "probability of probability of probability" - i.e. they ask only for the meta-meta-distribution of the value of the meta-distribution (or density function) at its object-level value.

Unsure if that's in Jaynes too.

Connection to logic seems questionable because it's hard to make logic and probability play nice together formally (maybe the intro to the Logical Inductors paper has good references for complaints about this).

Philosophically I think that there's something fishy going on here, and that calling something a "distribution over probabilities" is misleading. You have probability distributions when you're ignorant of something. But you're not actually ignorant about what probability you'd assign to the next flip being heads (or at least, not under Bayesian assumptions of infinite computational power).

Instead, the thing you're putting a meta-probability distribution over has to be something else that looks like your Bayesian probability but can be made distinct, like "long-run frequency if I flip the coin 10,000 times" or "correct value of some parameter in my physical model of the coin." It's very common for us to want to put probability distributions over these kinds of things, and so "meta-probabilities" are common.

And then your meta-meta-probability has to be about something distinct from the meta-probability! But now I'm sort of scratching my head about what that something is. Maybe "correct value of some parameter in a model of my reasoning about a physical model of the coin?"

Comment by Charlie Steiner on Investigating the learning coefficient of modular addition: hackathon project · 2023-10-17T22:47:25.897Z · LW · GW

I'm curious if you have guesses about how many singular dimensions were dead neurons (or neurons that are "mostly dead," only activating for a tiny fraction of the training set), versus how much the zero-gradient directions depended dynamically on training example.

Comment by Charlie Steiner on Don't leave your fingerprints on the future · 2023-10-13T19:11:49.147Z · LW · GW

To make sure things are clear: naturalists all agree there is a process as neutral as any other scientific process for doing meta-ethics – for determining what it is homo sapiens are doing when they engage in moralizing. This is the methodological (and ultimately, metaphysical) point of agreement between e.g. Blackburn and Boyd

How come they disagree on all those apparently non-spooky questions about relevant patterns in the world? I'm curious how you reconcile these.

In science the data is always open to some degree of interpretation, but a combination of the ability to repeat experiments independent of the experimenter and the precision with which predictions can be tested tends to gradually weed out different interpretations that actually bear on real-world choices.

If long-term disagreement is maintained, my usual diagnosis would be that the thing being disagreed about does not actually connect to observation in a way amenable to science. E.g. maybe even though it seems like "which patterns are important?" is a non-spooky question, actually it's very theory-laden in a way that's only tenuously connected to predictions about data (if at all), and so when comparing theories there isn't any repeatable experiment you could just stack up until you have enough data to answer the question.

Alternately, maybe at least one of them is bad at science :P

It's not at all obvious to me that our moral terms are not regulated by pretty stable patterns in our environment+behaviour and that together they don't form an attractor.

In the strong sense that everyone's use of "morality" converges to precisely the same referent under some distribution of "normal dynamics" like interacting with the world and doing self-reflection? That sort of miracle doesn't occur for the same reason coffee and cream don't spontaneously un-mix.

But that doesn't happen even for "tiger" - it's not necessary that everyone means precisely the same thing when they talk about tigers, as long as the amount of interpersonal noise doesn't overwhelm the natural sparsity of the world that allows us to have single-world handles for general categories of things. You could still call this an attractor, it's just not a pointlike attractor - there's space for different people to use "tiger" in different ways that are stable under normal dynamics.

If that's how it is for "morality" too ("if morality is as real as tigers" being a cheeky framing), then if we could somehow map where everyone is in concept space, I expect everyone can say "Look how close together everyone gets under normal dynamics, this can be framed as a morality attractor!" But it would be a mistake to then say "Therefore the most moral point is the center, we should all go there."

the actionable points of disagreement are things like "how much should we be willing to let complicated intuitions be overruled by simple intuitions?"

I suspect their disagreement is deeper than you think, but I'm not sure what you mean by this: care to clarify?

I forget what I was thinking, sorry. Maybe the general gist was "if you strip away the supposedly-contingent disagreements like 'is there a morality attractor,'" what are the remaining fundamental disagreements about how to do moral reasoning?

Comment by Charlie Steiner on [Linkpost] Generalization in diffusion models arises from geometry-adaptive harmonic representation · 2023-10-11T20:14:19.580Z · LW · GW

Okay, sure, I kind of buy it. Generated images are closer to each other than to the nearest image in the training set. And the denoisers learn similar heuristics like "do averaging" and "there's probably a face in the middle of the image."

I still don't really feel excited, but maybe that's me and not the paper.

Comment by Charlie Steiner on [Linkpost] Generalization in diffusion models arises from geometry-adaptive harmonic representation · 2023-10-11T18:33:13.599Z · LW · GW

After skimming this paper I don't feel that impressed. Maybe someone who read in detail could correct me.

There's a boring claim and an exciting claim here:

The boring claim is that diffusion models learn to generalize beyond their exact training set, and models trained on training sets drawn from the same distribution will be pretty good at unseen images drawn from that distribution - and because they're both pretty good, they'll overlap in their suggested desnoisings.

The exciting claim is that diffusion models trained on overlapping data from the same dataset learn nearly the same algorithms, which can be seen because they produce suggested denoisings that are similar in ways that would be vanishingly unlikely if they weren't overlapping mechanistically.

AFAICT, they show the boring claim that everyone already knew, and imply the exciting claim but don't support it at all.

Comment by Charlie Steiner on You’re Measuring Model Complexity Wrong · 2023-10-11T18:11:33.062Z · LW · GW

Pretty neat.

My ears perk up when I hear about approximations to basin size because it's related to the Bayesian NN model of uncertainty.

Suppose you have a classifier that predicts a probability distribution over outputs. Then when we want the uncertainty of the weights, we just use Bayes' rule, and because most of the terms don't matter we mostly carte that P(weights | dataset) has evidence ratio proportional to P(dataset | weights). If you're training on a predictive loss, your loss is basically the log of this P(dataset | weights), and so a linear weighting of probability turns into an exponential weighting of loss.

I.e. you end up (in theory that doesn't always work) with a Boltzmann distribution sitting at the bottom of your loss basin (skewed by a regularization term). Broader loss basins directly translate to more uncertainty over weights.

Hm... But I guess thinking about this really just highlights for me the problems with the approximations used to get uncertainties out of the Bayesian NN picture. Knowing the learning coefficient is of limited use because, especially when some dimensions are different, you can't really model all directions in weight-space as interchangeable and uncorrelated, so increased theoretical firepower doesn't translate to better uncertainty estimates as nicely as I'd like.

Comment by Charlie Steiner on I'm a Former Israeli Officer. AMA · 2023-10-10T18:29:37.562Z · LW · GW

Like... training people to do policing (or build buildings, or run schools) in occupied territory, when those people could have instead be trained only to defeat enemy soldiers. Or building a stockpile of equipment for providing clean drinking water to residents of occupied territory, in addition to the stockpile of weapons for defeating enemy soldiers. That sort of planning.

Comment by Charlie Steiner on I'm a Former Israeli Officer. AMA · 2023-10-10T14:23:02.857Z · LW · GW

Has Israel been building capacity for peacekeeping / nation-building type operations, of the sort that the U.S. really wished it had before trying to do that sort of thing?

Comment by Charlie Steiner on Comparing Anthropic's Dictionary Learning to Ours · 2023-10-08T14:15:35.039Z · LW · GW

I'd definitely be interested if you have any takes on tied vs. untied weights.

It seems like the goal is to get the maximum expressive power for the autoencoder, while still learning features that are linear. So are untied weights good because they're more expressive, or are they bad because they encourage the model to start exploiting nonlinearity?

Comment by Charlie Steiner on On my AI Fable, and the importance of de re, de dicto, and de se reference for AI alignment · 2023-10-06T11:24:43.430Z · LW · GW

you seem to be hoping the robot will infer that the agent of the desired act is the human, both in the case of the human, and of the AI

No, I'm definitely just thinking about IRL here.

IRL takes a model of the world and of the human's affordances as given constants, assumes the human is (maybe noisily) rational, and then infers human desires in terms of that world model, which then can also be used by the AI to choose actions if you have a model of the AI's affordances. It has many flaws, but it's definitely worth refreshing yourself about occasionally.

Comment by Charlie Steiner on How to solve deception and still fail. · 2023-10-05T16:56:57.305Z · LW · GW

Fair point (though see also the section on how the training+deployment process can be "deceptive" even if the AI itself never searches for how to manipulate you). By "Solve deception" I mean that in a model-based RL kind of setting, we can know the AI's policy and its prediction of future states of the world (it doesn't somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that's a fairly natural interpretation.

Comment by Charlie Steiner on On my AI Fable, and the importance of de re, de dicto, and de se reference for AI alignment · 2023-10-05T06:07:11.075Z · LW · GW

Nice read. Not an actual problem with inverse RL, I think, because an AI observing human24 try to get a cookie will learn want(human24 gets cookie) not want(I* get cookie) unless you've put in special work to make it otherwise. But potentially a problem with more abstract cashings-out of the idea "learn human values and then want that." 

Comment by Charlie Steiner on How to solve deception and still fail. · 2023-10-05T05:54:33.949Z · LW · GW


I should acknowledge that conditioning on a lot of "actually good" answers to those questions would indeed be reassuring.

The point is more that humans are easily convinced by "not actually good" answers to those questions, if the question-answerer has been optimized to get human approval.



Okay, suppose you're a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you're asked

are there questions which we would regret asking you, according to our own current values?

What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals?

How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies?

How about appending text explaining your answer that changes the humans' minds to be more accepting of hedonic utilitarianism?

If the question is extra difficult for you, like

What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?

, dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they'd rather trust in some value extrapolation process that finds better, more universal things to care about.

Comment by Charlie Steiner on energy landscapes of experts · 2023-10-03T00:00:40.713Z · LW · GW

If you want to make an analogy to non-convex optimization, what's the analogue of the thing you're optimizing for? In the example of medicine, there you don't seem to be talking about any fixed loss function, you're just sort of optimizing for expertiness according to other nearby experts. (This may make a big neon sign saying "PageRank" light up in your brain.)

Comment by Charlie Steiner on Basic Mathematics of Predictive Coding · 2023-10-01T10:28:49.309Z · LW · GW

Great explanation, thanks! Although I experienced deja vu ("didn't you already tell me this?") somewhere in the middle and skipped to comparisons to deep learning :)

One thing I didn't see is a discussion of the setting of these "prior activations" that are hiding in the deeper layers of the network.

If you have dynamics where activations change faster than data, and data changes faster than weights, this means that the weights are slowly being trained to get low loss on images averaged out over time. This means the weights will start to encode priors: If data changes continuously the priors will be about continuous changes, if you're suddenly flashing between different still frames the priors will be about still frames (even if you're resetting the activations in between).


Comment by Charlie Steiner on Alignment Workshop talks · 2023-09-29T06:41:44.711Z · LW · GW

The day 2 lightning talks were really great.

Comment by Charlie Steiner on ARC Evals: Responsible Scaling Policies · 2023-09-29T00:56:28.671Z · LW · GW

So here you're talking about situations where a false negative doesn't have catastrophic consequences?

No, we'll have to make false positive / false negative tradeoffs about ending the world as well. We're unlucky like that.

I agree that false sense of security / safetywashing is a potential use of this kind of program.

A model doing something equivalent to this (though presumably not mechanistically like this) will do:

  • If (have influence over adequate resources)
    • Search widely for ways to optimize for x.
  • Else
    • Follow this heuristic-for-x that produces good behaviour on the training and test set.

So long as x isn't [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that's enough for us to be screwed. (of course I expect something like [both "have influence over adequate resources" and "find better ways to aim for x" to be processes that are used at lower levels by heuristic-for-x])

I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it's not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.

I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don't claim an example of this particular form is probable.

I do claim that knowing there's no deliberate deception is insufficient - and more generally that we'll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)

I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting - being able to search for what states of self-reflection would encourage them to display high capabilities.

I'm somewhat more confident in our ability to think of things to test for. If an AI has "that spark of generality," it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.

If not, sidestepping deception doesn't look particularly important.
If so, I remain confused by your level of confidence.

I retain the right to be confident in unimportant things :P 

Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.

Comment by Charlie Steiner on ARC Evals: Responsible Scaling Policies · 2023-09-28T18:32:12.107Z · LW · GW

Why do you expect this? Is this a ~70% 'expect' or a ~99% 'expect'?

Let's say 95%.

But I should probably clarify what "this" I'm supposed to be expecting here.

If you're trying to notice when AIs are dangerous in different ways by evaluating their capabilities, you face a thorny problem that you're probably way worse at eliciting capabilities than the aggregate effort of the public. And you're probably also less creative about ways that AIs can be dangerous. And there might be ecosystem effects, where AIs used together might be complementary in ways that bypass restrictions on single AIs.

My claim is that the AI deliberately deceiving you about its capabilities is not one of those thorny problems, in the current paradigm.

Do you expect capability evaluations to be robust (i.e. no false negatives)? Why?

I expect that we can make a false positive / false negative tradeoff. However, this is practically guaranteed to make correctly calibrated standards seem overly conservative to human intuition. This has obvious problems for getting people bought into "RSPs" that actually do their job.

Do you expect it'll be clear when we might start getting false negatives? Why?

I now suspect you're asking "do I expect it'll be clear when we start getting false negatives because the AI is situationally aware and is deceiving us." The previous answer should be interpreted as being about false negatives due to other issues as well.

I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us about that we would be toast for other reasons before we ran any tests.

I can certainly imagine possible architectures that would have enough control over their internals to be able to sandbag on a test purely by knowing it was coming, but although someone would probably try to build AI like this, it does seem like the sort of thing you would notice, barring human deception to get around irksome safety oversight.

What do you mean by "deceptive alignment" exactly? Does your definition cover every case of [this thing looked aligned according to tests we believed were thorough, but later it killed us anyway]?

I'm unsure what you're getting at. I think "testing for alignment" isn't a very good goal. To succeed at AI alignment we should know ahead of time what we want the AI to be doing, and perhaps if this can be broken down into pieces we can test that the AI is doing each piece, and that would count as "testing for alignment." But I don't think there's some useful alignment test that's easier than knowing how to align an AI.

Comment by Charlie Steiner on ARC Evals: Responsible Scaling Policies · 2023-09-28T12:59:58.472Z · LW · GW

I mean, the good news is that I expect that by focusing on evaluating capability of predictively trained models, (and potentially by doing more research on latent adversarial attacks on more agenty models), problems of deceptive alignment can be sidestepped in practice.

My worry is more along the lines of actually getting people to not build the cool next step AI that's tantalizingly over the horizon. Government may be a mechanism for enforcing some measure of institutional buy-in, though.