Posts

So You Want To Make Marginal Progress... 2025-02-07T23:22:19.825Z
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals 2025-01-24T20:20:28.881Z
The Case Against AI Control Research 2025-01-21T16:03:10.143Z
What Is The Alignment Problem? 2025-01-16T01:20:16.826Z
The Plan - 2024 Update 2024-12-31T13:29:53.888Z
The Field of AI Alignment: A Postmortem, and What To Do About It 2024-12-26T18:48:07.614Z
What Have Been Your Most Valuable Casual Conversations At Conferences? 2024-12-25T05:49:36.711Z
The Median Researcher Problem 2024-11-02T20:16:11.341Z
Three Notions of "Power" 2024-10-30T06:10:08.326Z
Information vs Assurance 2024-10-20T23:16:25.762Z
Minimal Motivation of Natural Latents 2024-10-14T22:51:58.125Z
Values Are Real Like Harry Potter 2024-10-09T23:42:24.724Z
We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap 2024-09-19T22:22:05.307Z
Why Large Bureaucratic Organizations? 2024-08-27T18:30:07.422Z
... Wait, our models of semantics should inform fluid mechanics?!? 2024-08-26T16:38:53.924Z
Interoperable High Level Structures: Early Thoughts on Adjectives 2024-08-22T21:12:38.223Z
A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed 2024-08-22T19:19:28.940Z
What is "True Love"? 2024-08-18T16:05:47.358Z
Some Unorthodox Ways To Achieve High GDP Growth 2024-08-08T18:58:56.046Z
A Simple Toy Coherence Theorem 2024-08-02T17:47:50.642Z
A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication 2024-07-26T00:33:42.000Z
(Approximately) Deterministic Natural Latents 2024-07-19T23:02:12.306Z
Dialogue on What It Means For Something to Have A Function/Purpose 2024-07-15T16:28:56.609Z
3C's: A Recipe For Mathing Concepts 2024-07-03T01:06:11.944Z
Corrigibility = Tool-ness? 2024-06-28T01:19:48.883Z
What is a Tool? 2024-06-25T23:40:07.483Z
Towards a Less Bullshit Model of Semantics 2024-06-17T15:51:06.060Z
My AI Model Delta Compared To Christiano 2024-06-12T18:19:44.768Z
My AI Model Delta Compared To Yudkowsky 2024-06-10T16:12:53.179Z
Natural Latents Are Not Robust To Tiny Mixtures 2024-06-07T18:53:36.643Z
Calculating Natural Latents via Resampling 2024-06-06T00:37:42.127Z
Value Claims (In Particular) Are Usually Bullshit 2024-05-30T06:26:21.151Z
When Are Circular Definitions A Problem? 2024-05-28T20:00:23.408Z
Why Care About Natural Latents? 2024-05-09T23:14:30.626Z
Some Experiments I'd Like Someone To Try With An Amnestic 2024-05-04T22:04:19.692Z
Examples of Highly Counterfactual Discoveries? 2024-04-23T22:19:19.399Z
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer 2024-04-18T00:27:43.451Z
Generalized Stat Mech: The Boltzmann Approach 2024-04-12T17:47:31.880Z
How We Picture Bayesian Agents 2024-04-08T18:12:48.595Z
Coherence of Caches and Agents 2024-04-01T23:04:31.320Z
Natural Latents: The Concepts 2024-03-20T18:21:19.878Z
The Worst Form Of Government (Except For Everything Else We've Tried) 2024-03-17T18:11:38.374Z
The Parable Of The Fallen Pendulum - Part 2 2024-03-12T21:41:30.180Z
The Parable Of The Fallen Pendulum - Part 1 2024-03-01T00:25:00.111Z
Leading The Parade 2024-01-31T22:39:56.499Z
A Shutdown Problem Proposal 2024-01-21T18:12:48.664Z
Some Vacation Photos 2024-01-04T17:15:01.187Z
Apologizing is a Core Rationalist Skill 2024-01-02T17:47:35.950Z
The Plan - 2023 Version 2023-12-29T23:34:19.651Z
Natural Latents: The Math 2023-12-27T19:03:01.923Z

Comments

Comment by johnswentworth on How might we safely pass the buck to AI? · 2025-02-23T05:39:58.612Z · LW · GW

That's a much more useful answer, actually. So let's bring it back to Eliezer's original question:

Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"?  Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up.  Just whatever simple principle you think lets you bypass GIGO.

[...]

Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem". 

So to summarize your short, simple answer to Eliezer's question: you want to "train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end". And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.

Or, to compact the summary even further: you want to train the somewhat-smarter-than-human AI on easily-verifiable synthetically-generated tasks, and then hope/expect that its good performance on those tasks generalizes to a problem which is not easily verifiable or synthetically generated, namely the problem of checking that a next generation of AI is in the basin of attraction of a good-to-humans outcome.

(Note: I know you've avoided talking about the basin of attraction of a good-to-humans outcome, instead focused on just some short-term goal like e.g. not being killed by the very next generation of AI. Not focusing on the basin of attraction is a mistake, and we can go into why it's a mistake if that turns out to be cruxy.)

In Eliezer's comment, he was imagining a training setup somewhat different from easily-verifiable synthetically-generated tasks:

Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks.  How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?

... but the analogue of the problem Eliezer was highlighting, in the context of training on easily-verifiable synthetically-generated tasks, is the question: how and why would we justifiably trust that an AI trained on easily-verifiable synthetic tasks generalizes to not-easily-verifiable real-world tasks?

Comment by johnswentworth on How might we safely pass the buck to AI? · 2025-02-22T18:40:02.518Z · LW · GW

The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.

This seems very blatantly not viable-in-general, in both theory and practice.

On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation requirements are, in fact, a thing. So you can't just have a bunch of Eliezers do in 3 months what a single 6 month Eliezer could do, in general.

Even if you allow the 3-month Eliezer sims to act one-after-another, rather than trying to get everything done in 3 months simulated time via pure parallelism, there's still a tight communication bottleneck between each Eliezer sim. There are presumably plenty of circuits which cannot be implemented with tight information bottlenecks every n serial steps.

... of course you could circumvent all that theory by e.g. just having each Eliezer emulate a single logic gate, or some comparably trivial thing, but at that point you run afoul of the non-compositionality of safety properties: putting together a bunch of "safe" things (or "interpretable" things, or "aligned" things, or "corrigible" things, ...) does not in-general produce a safe thing.

So that's the theory side. What about the practice side?

Well, Ought did that roughly that experiment years ago and it did not work at all. And that should not be surprising - as the link argues, we have extremely ample evidence from day-to-day life that such things do not work in general. 

Comment by johnswentworth on Why Agent Foundations? An Overly Abstract Explanation · 2025-02-13T23:04:31.461Z · LW · GW

I think that's basically right, and good job explaining it clearly and compactly.

I would also highlight that it's not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-09T17:07:56.591Z · LW · GW

Now to get a little more harsh...

Without necessarily accusing Kaj specifically, this general type of argument feels motivated to me. It feels like willful ignorance, like sticking one's head in the sand and ignoring the available information, because one wants to believe that All Research is Valuable or that one's own research is valuable or some such, rather than facing the harsh truth that much research (possibly one's own) is predictably-in-advance worthless.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-09T16:51:35.628Z · LW · GW

That sort of reasoning makes sense insofar as it's hard to predict which small pieces will be useful. And while that is hard to some extent, it is not full we-just-have-no-idea-so-use-a-maxent-prior hard. There is plenty of work (including lots of research which people sink their lives into today) which will predictably-in-advance be worthless. And robust generalizability is the main test I know of for that purpose.

Applying this to your own argument:

Often when I've had a hypothesis about something that interests me, I've been happy that there has been *so much* scientific research done on various topics, many of them seemingly insignificant. While most of it is of little interest to me, the fact that there's so much of it means that there's often some prior work on topics that do interest me.

It will predictably and systematically be the robustly generalizable things which are relevant to other people in unexpected ways.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-08T03:12:36.783Z · LW · GW

Yup, if you actually have enough knowledge to narrow it down to e.g. a 65% chance of one particular major route, then you're good. The challenging case is when you have no idea what the options even are for the major route, and the possibility space is huge.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-08T02:46:51.977Z · LW · GW

Yeah ok. Seems very unlikely to actually happen, and unsure whether it would even work in principle (as e.g. scaling might not take you there at all, or might become more resource intensive faster than the AIs can produce more resources). But I buy that someone could try to intentionally push today's methods (both AI and alignment) to far superintelligence and simply turn down any opportunity to change paradigm.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-08T02:43:39.069Z · LW · GW

Aliens kill you due to slop, humans depend on the details.

The basic issue here is that the problem of slop (i.e. outputs which look fine upon shallow review but aren't fine) plus the problem of aligning a parent-AI in such a way that its more-powerful descendants will robustly remain aligned, is already the core of the superintelligence alignment problem. You need to handle those problems in order to safely do the handoff, and at that point the core hard problems are done anyway. Same still applies to aliens: in order to safely do the handoff, you need to handle the "slop/nonslop is hard to verify" problem, and you need to handle the "make sure agents the aliens build will also be aligned, and their children, etc" problem.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-08T01:45:19.276Z · LW · GW

It's not clear to me we'll have (or will "need") new paradigms before fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks.

If you want to not die to slop, then "fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks" not a thing which happens at all until the full superintelligence alignment problem is solved. That is how you die to slop.

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-08T01:43:09.856Z · LW · GW

If by superintelligence, you mean wildly superhuman AI, it remains non-obvious to me that new paradigms are needed (though I agree they will pretty likely arise prior to this point due to AIs doing vast quantity of research if nothing else). I think thoughtful and laborious implementation of current paradigm strategies (including substantial experimentation) could directly reduce risk from handing off to superintelligence down to perhaps 25% and I could imagine being argued considerably lower.

I find it hard to imagine such a thing being at all plausible. Are you imagining that jupiter brains will be running neural nets? That their internal calculations will all be differentiable? That they'll be using strings of human natural language internally? I'm having trouble coming up with any "alignment" technique of today which would plausibly generalize to far superintelligence. What are you picturing?

Comment by johnswentworth on So You Want To Make Marginal Progress... · 2025-02-08T01:04:47.685Z · LW · GW

This post seems to assume that research fields have big hard central problems that are solved with some specific technique or paradigm.

This isn't always true. [...]

I would say it is basically-always true, but there are some fields (including deep learning today, for purposes of your comment) where the big hard central problems have already been solved, and therefore the many small pieces of progress on subproblems are all of what remains.

And insofar as there remains some problem which is simply not solvable within a certain paradigm, that is a "big hard central problem", and progress on the smaller subproblems of the current paradigm is unlikely by-default to generalize to whatever new paradigm solves that big hard central problem.

I agree that paradigm shifts can invalidate large amounts or prior work, but it isn't obvious whether this will occur in AI safety.

I claim it is extremely obvious and very overdetermined that this will occur in AI safety sometime between now and superintelligence. The question which you'd probably find more cruxy is not whether, but when - in particular, does it come before or after AI takes over most of the research?

... but (I claim) that shouldn't be the cruxy question, because we should not be imagining completely handing off the entire alignment-of-superintelligence problem to early transformative AI; that's a recipe for slop. We ourselves need to understand a lot about how things will generalize beyond the current paradigm, in order to recognize when that early transformative AI is itself producing research which will generalize beyond the current paradigm, in the process of figuring out how to align superintelligence. If an AI assistant produces alignment research which looks good to a human user, but won't generalize across the paradigm shifts between here and superintelligence, then that's a very plausible way for us to die.

Comment by johnswentworth on The Risk of Gradual Disempowerment from AI · 2025-02-07T20:17:31.611Z · LW · GW

Most importantly, current proposed technical plans are necessary but not sufficient to stop this. Even if the technical side fully succeeds no one knows what to do with that.

I don't think that's quite accurate. In particular, gradual disempowerment is exactly the sort of thing which corrigibility would solve. (At least for "corrigibility" in the sense David and I use the term, and probably Yudkowsky, but not Christiano's sense; he uses the term to mean a very different thing.)

A general-purpose corrigible AI (in the sense we use the term) is pretty accurately thought-of as an extension of the user. Building and using such an AI is much more like "uplifting" the user than like building an independent agent. It's the cognitive equivalent of gaining prosthetic legs, as opposed to having someone carry you around on a sedan. Another way to state it: a corrigible subsystem acts like it's a part of a larger agent, serving a particular purpose as a component of the larger agent, as opposed to acting like an agent in its own right.

... admittedly corrigibility is still very much in the "conceptual" stage, far from an actual technical plan. But it's at least a technical research direction which would pretty directly address the disempowerment problem.

Comment by johnswentworth on C'mon guys, Deliberate Practice is Real · 2025-02-07T17:03:07.547Z · LW · GW

I think that accurately summarizes the concrete things, but misses a mood.

The missing mood is less about "grappling with the enormity of it", and more about "grappling with the effort of it" or "accepting the need to struggle for real". Like, in terms of time spent, we're talking maybe a year or two of full-time effort, spread out across maybe 3-5 years. That's far more than the post grapples with, but not prohibitively enormous; it's comparable to getting a degree. The missing mood is more about "yup, it's gonna be hard, and I'm gonna have to buckle down and do the hard thing for reals, not try to avoid it". The technical knowledge is part of that - like, "yup, I'm gonna have to actually for-reals learn some gnarly technical stuff, not try to avoid it". But not just the technical study. For instance, actually for-reals noticing and admitting when I'm avoiding unpleasant truths, or when my plans won't work and I need to change tack, has a similar feel: "yup, my plans are actually for-reals trash, I need to actually for-reals update, and I don't yet have any idea what to do instead, and it looks hard".

Comment by johnswentworth on C'mon guys, Deliberate Practice is Real · 2025-02-07T00:56:37.146Z · LW · GW

I don't claim to know all the key pieces of whatever they did, but some pieces which are obvious from talking to them both and following their writings:

  • Both invested heavily in self-studying a bunch of technical material.
  • Both heavily practiced the sorts of things Joe described in his recent "fake thinking and real thinking" post. For instance, they both have a trained habit of noticing the places where most people would gloss over things (especially technical/mathematical), and instead looking for a concrete example.
  • Both heavily practiced the standard metacognitive moves, e.g. noticing when a mental move worked very well/poorly and reinforcing accordingly, asking "how could I have noticed that faster", etc.
  • Both invested in understanding their emotional barriers and figuring out sustainable ways to handle them. (I personally think both of them have important shortcomings in that department, but they've at least put some effort into it and are IMO doing better than they would otherwise.)
  • Both have a general mindset of actually trying to notice their own shortcomings and improve, rather than make excuses or hide.
  • And finally: both put a pretty ridiculous amount of effort into improving, in terms of raw time and energy, and in terms of emotional investment, and in terms of "actually doing the thing for real not just talking or playing".
Comment by johnswentworth on C'mon guys, Deliberate Practice is Real · 2025-02-06T21:31:33.607Z · LW · GW

I feel like the "this will just take so much time" section doesn't really engage with the full-strength version of the critique.

When I think of people I know who have successfully gone from unremarkable to reasonably impressive via some kind of deliberate practice and training, the list consists of Nate Soares and Alex Turner. That's it; that's the entire list. Notably, they both followed a pretty similar path to get there (not by accident). And that path is not short.

Sure, you can do a lightweight version of "feedbackloop rationality" which just involves occasional short reviews or whatever. But that does not achieve most of the value.

My pitch to someone concerned about the timesink would instead be roughly: "Look, you know deep down that you are incompetent and don't know what you're doing (in the spirit of Impostor Syndrome and the Great LARP). You're clinging to these stories about how the thing you're currently doing happens to be useful somehow, but some part of you knows perfectly well that you're doing your current work mainly just because you can do your current work without facing the scary fact of your own ineptitude when faced with any actually useful (and difficult) problem. The only way you will ever do something of real value is if you level the fuck up. Is that going to be a big timesink and take a lot of effort? Yes. But guess what? There isn't actually a trade-off here. Your current work is not useful, you are not going to do anything useful without getting stronger, so how about you stop hiding from your own ineptitude and just go get stronger already?".

(And yes, that is the sort of pitch I sometimes make to myself when considering whether to dump time and effort into some form of deliberate practice.)

Comment by johnswentworth on What do coherence arguments actually prove about agentic behavior? · 2025-02-03T16:57:29.298Z · LW · GW

If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).

That is indeed what I had in mind when I said we'd need another couple sentences to argue that the agent maximizes expected utility under the distribution. It is less circular than it might seem at first glance, because two importantly different kinds of probabilities are involved: uncertainty over the environment (which is what we're deriving), and uncertainty over the agent's own actions arising from mixed strategies.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-31T18:23:51.066Z · LW · GW

Big crux here: I don't actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result.

As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky and with Christiano. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs.

By contrast, when I read the stuff written on the control agenda... it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever's writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe's recent post on "fake vs real thinking" feels like it's pointing at the right thing here; the posts on control feel strongly like "fake" thinking.) And that's not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways.

... so mostly I've tried to argue at a different level, like e.g. in the Why Not Just... posts. The goal there isn't really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-31T16:21:34.450Z · LW · GW

Even if all of those are true, the argument in the post would still imply that control research (at least of the sort people do today) cannot have very high expected value. Like, sure, let's assume for sake of discussion that most total AI safety research will be done by early transformative AI, that the only chance of aligning superintelligent AIs is to delegate, that control research is unusually tractable, and that for some reason we're going to use the AIs to pursue formal verification (not a good idea, but whatever).

Even if we assume all that, we still have the problem that control research of the sort people do today does basically-nothing to address slop; it is basically-exclusively focused on intentional scheming. Insofar as intentional scheming is not the main thing which makes outsourcing to early AIs fail, all that control research cannot have very high expected value. None of your bullet points address that core argument at all.

Comment by johnswentworth on johnswentworth's Shortform · 2025-01-30T17:14:30.071Z · LW · GW

Just because the number of almost-orthogonal vectors in  dimensions scales exponentially with , doesn't mean one can choose all those signals independently. We can still only choose  real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So "more intended behaviors than input-vector components" just isn't an option, unless you're exploiting some kind of low-information-density in the desired behaviors (like e.g. very "sparse activation" of the desired behaviors, or discreteness of the desired behaviors to a limited extent).

Comment by johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-30T17:07:27.122Z · LW · GW

TBC, I don't particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up "property rights". So, a system can be generally corrigible by "respecting the convergent property rights", so to speak.

Comment by johnswentworth on Yudkowsky on The Trajectory podcast · 2025-01-28T21:57:06.130Z · LW · GW

40 standard deviations away from natural human IQ would have an IQ of 600

Nitpick: 700.

Comment by johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-27T18:38:56.887Z · LW · GW

Yup, exactly, and good job explaining it too.

Comment by johnswentworth on johnswentworth's Shortform · 2025-01-26T21:13:27.701Z · LW · GW

Yup, I'm familiar with that one. The big difference is that I'm backward-chaining, whereas that post forward chains; the hope of backward chaining would be to identify big things which aren't on peoples' radar as nootropics (yet).

(Relatedly: if one is following this sort of path, step 1 should be a broad nutrition panel and supplementing anything in short supply, before we get to anything fancier.)

Comment by johnswentworth on johnswentworth's Shortform · 2025-01-26T20:12:08.245Z · LW · GW

Here's a side project David and I have been looking into, which others might have useful input on...

Background: Thyroid & Cortisol Systems

As I understand it, thyroid hormone levels are approximately-but-accurately described as the body's knob for adjusting "overall metabolic rate" or the subjective feeling of needing to burn energy. Turn up the thyroid knob, and people feel like they need to move around, bounce their leg, talk fast, etc (at least until all the available energy sources are burned off and they crash). Turn down the thyroid knob, and people are lethargic.

That sounds like the sort of knob which should probably typically be set higher, today, than was optimal in the ancestral environment. Not cranked up to 11; hyperthyroid disorders are in fact dangerous and unpleasant. But at least set to the upper end of the healthy range, rather than the lower end.

... and that's nontrivial. You can just dump the relevant hormones (T3/T4) into your body, but there's a control system which tries to hold the level constant. Over the course of months, the thyroid gland (which normally produces T4) will atrophy, as it shrinks to try to keep T4 levels fixed. Just continuing to pump T3/T4 into your system regularly will keep you healthy - you'll basically have a hypothyroid disorder, and supplemental T3/T4 is the standard treatment. But you better be ready to manually control your thyroid hormone levels indefinitely if you start down this path. Ideally, one would intervene further up the control loop in order to adjust the thyroid hormone set-point, but that's more of a research topic than a thing humans already have lots of experience with.

So that's thyroid. We can tell a similar story about cortisol.

As I understand it, the cortisol hormone system is approximately-but-accurately described as the body's knob for adjusting/tracking stress. That sounds like the sort of knob which should probably be set lower, today, than was optimal in the ancestral environment. Not all the way down; problems would kick in. But at least set to the lower end of the healthy range.

... and that's nontrivial, because there's a control loop in place, etc. Ideally we'd intervene on the relatively-upstream parts of the control loop in order to change the set point.

We'd like to generalize this sort of reasoning, and ask: what are all the knobs of this sort which we might want to adjust relative to their ancestral environment settings?

Generalization

We're looking for signals which are widely broadcast throughout the body, and received by many endpoints. Why look for that type of thing? Because the wide usage puts pressure on the signal to "represent one consistent thing". It's not an accident that there are individual hormonal signals which are approximately-but-accurately described by the human-intuitive phrases "overall metabolic rate" or "stress". It's not an accident that those hormones' signals are not hopelessly polysemantic. If we look for widely-broadcast signals, then we have positive reason to expect that they'll be straightforwardly interpretable, and therefore the sort of thing we can look at and (sometimes) intuitively say "I want to turn that up/down".

Furthermore, since these signals are widely broadcast, they're the sort of thing which impacts lots of stuff (and is therefore impactful to intervene upon). And they're relatively easy to measure, compared to "local" signals.

The "wide broadcast" criterion helps focus our search a lot. For instance, insofar as we're looking for chemical signals throughout the whole body, we probably want species in the bloodstream; that's the main way a concentration could be "broadcast" throughout the body, rather than being a local signal. So, basically endocrine hormones.

Casting a slightly wider net, we might also be interested in:

  • Signals widely broadcast through the body by the nervous system.
  • Chemical signals widely broadcast through the brain specifically (since that's a particularly interesting/relevant organ).
  • Non-chemical signals widely broadcast through the brain specifically.

... and of course for all of these there will be some control system, so each has its own tricky question about how to adjust it.

Some Promising Leads, Some Dead Ends

With some coaxing, we got a pretty solid-sounding list of endocrine hormones out of the LLMs. There were some obvious ones on the list, including thyroid and cortisol systems, sex hormones, and pregnancy/menstruation signals. There were also a lot of signals for homeostasis of things we don't particularly want to adjust: salt balance, calcium, digestion, blood pressure, etc. There were several inflammation and healing signals, which we're interested in but haven't dug into yet. And then there were some cool ones: oxytocin (think mother-child bonding), endocannabinoids (think pot), satiety signals (think Ozempic). None of those really jumped out as clear places to turn a knob in a certain direction, other than obvious things like "take ozempic if you are even slightly overweight" and the two we already knew about (thyroid and cortisol).

Then there were neuromodulators. Here's the list we coaxed from the LLMs:

  • Dopamine: Tracks expected value/reward - how good things are compared to expectations.
  • Norepinephrine: Sets arousal/alertness level - how much attention and energy to devote to the current situation.
  • Serotonin: Regulates resource availability mindset - whether to act like resources are plentiful or scarce. Affects patience, time preference, and risk tolerance.
  • Acetylcholine: Controls signal-to-noise ratio in neural circuits - acts like a gain/precision parameter, determining whether to amplify precise differences (high ACh) or blur things together (low ACh).
  • Histamine: Manages the sleep/wake switch - promotes wakefulness and suppresses sleep when active.
  • Orexin: Acts as a stability parameter for brain states - increases the depth of attractor basins and raises transition barriers between states. Higher orexin = stronger attractors = harder to switch states.

Of those, serotonin immediately jumps out as a knob you'd probably want to turn to the "plentiful resources" end of the healthy spectrum, compared to the ancestral environment. That puts the widespread popularity of SSRIs in an interesting light!

Moving away from chemical signals, brain waves (alpha waves, theta oscillations, etc) are another potential category - they're oscillations at particular frequencies which (supposedly) are widely synced across large regions of the brain. I read up just a little, and so far have no idea how interesting they are as signals or targets.

Shifting gears, the biggest dead end so far has been parasympathetic tone, i.e. overall activation level of the parasympathetic nervous system. As far as I can tell, parasympathetic tone is basically Not A Thing: there are several different ways to measure it, and the different measurements have little correlation. It's probably more accurate to think of parasympathetic nervous activity as localized, without much meaningful global signal.

Anybody see obvious things we're missing?

Comment by johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-26T19:50:09.348Z · LW · GW

However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next".

That's an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.

The "uncertainty regarding the utility function" enters here mainly when we invoke instrumental convergence, in hopes that the subagent can "act as though other subagents are also working torward its terminal goal" in a way agnostic to its terminal goal. Which is a very different role than the old "corrigibility via uncertainty" proposals.

Comment by johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-26T18:58:31.036Z · LW · GW

Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say "corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals"; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.

Comment by johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-26T18:55:55.370Z · LW · GW

I think this misunderstands the idea, mainly because it's framing things in terms of subagents rather than subgoals. Let me try to illustrate the picture in my head. (Of course at this stage it's just a hand-wavy mental picture, I don't expect to have the right formal operationalization yet.)

Imagine that the terminal goal is some optimization problem. Each instrumental goal is also an optimization problem, with a bunch of constraints operationalizing the things which must be done to avoid interfering with other subgoals. The instrumental convergence we're looking for here is mainly in those constraints; we hope to see that roughly the same constraints show up in many instrumental goals for many terminal goals. Insofar as we see convergence in the constraints, we can forget about the top-level goal, and expect that a (sub)agent which respects those constraints will "play well" in an environment with other (sub)agents trying to achieve other instrumental and/or terminal goals.

Then, addressing this part specifically:

I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.

... that would only happen insofar as converting the universe into batteries, computers and robots can typically be done without interfering with other subgoals, for a wide variety of terminal objectives. If it does interfere with other subgoals (for a wide variety of terminal objectives), then the constraints would say "don't do that".

And to be clear, maybe there would be some large-scale battery/computer/robot building! But it would be done in a way which doesn't step on the toes of other subplans, and makes the batteries/computers/robots readily available and easy to use for those other subplans.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-25T03:30:54.506Z · LW · GW

As in, the claim is that there is almost always a "schelling" mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn't make much difference?

The latter.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-24T17:36:29.170Z · LW · GW

To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than "run it twice on the original question and a rephrasing/transformation of the question".

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-24T16:33:29.824Z · LW · GW

But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).

Strong disagree with this. Probably not the most cruxy thing for us, but I'll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.

The reason this doesn't work is that the prototypical "blatant lie" doesn't look like "the model chooses a random number to output". The prototypical blatant lie is that there's a subtle natural mistake one could make in reasoning about the question, the model "knows" that it's a mistake, but the model just presents an argument with the subtle mistake in it.

Or, alternatively: the model "knows" what kind of argument would seem most natural to the user, and presents that kind of argument despite "knowing" that it systematically overlooks major problems.

Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there's a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF'd and got positive feedback for matching supposed-experts' answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.

These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal "knows" the right answer in some sense.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-23T20:24:29.406Z · LW · GW

The first three aren't addressable by any technical research or solution. Corporate leaders might be greedy, hubristic, and/or reckless. Or human organizations might not be nimble enough to effect development and deployment of the maximum safety we are technically capable of. No safety research portfolio addresses those risks. The other four are potential failures by us as a technical community that apply broadly. If too high a percentage of the people in our space are bad statisticians, can't think distributionally, are lazy or prideful, or don't understand causal reasoning well enough, that will doom all potential directions of AI safety research, not just AI control.

Technical research can have a huge impact on these things! When a domain is well-understood in general (think e.g. electrical engineering), it becomes far easier and cheaper for human organizations to successfully coordinate around the technical knowledge, for corporate leaders to use it, for bureaucracies to build regulations based on its models, for mid researchers to work in the domain without deceiving or confusing themselves, etc. But that all requires correct and deep technical understanding first.

Now, you are correct that a number of other AI safety subfields suffer from the same problems to some extent. But that's a different discussion for a different time.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-23T20:13:29.796Z · LW · GW

I think part of John's belief is more like "the current Control stuff won't transfer to the society of Von Neumanns."

That is also separately part of my belief, medium confidence, depends heavily on which specific thing we're talking about. Notably that's more a criticism of broad swaths of prosaic alignment work than of control specifically.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-23T20:09:43.646Z · LW · GW

Yeah, basically. Or at least unlikely that they're scheming enough or competently enough for it to be the main issue.

For instance, consider today's AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-23T19:47:58.849Z · LW · GW

No, more like a disjunction of possibilities along the lines of:

  • The critical AGIs come before huge numbers of von Neumann level AGIs.
  • At that level, really basic stuff like "just look at the chain of thought" turns out to still work well enough, so scheming isn't a hard enough problem to be a bottleneck.
  • Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don't fully cooperate with each other).
  • "huge numbers of von Neumann level AGIs" and/or "scheming" turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all.

Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.

Comment by johnswentworth on [Cross-post] Every Bay Area "Walled Compound" · 2025-01-23T17:19:37.572Z · LW · GW
Comment by johnswentworth on The Case Against AI Control Research · 2025-01-22T16:00:58.579Z · LW · GW

That's where most of the uncertainty is, I'm not sure how best to price it in (though my gut has priced in some estimate).

Comment by johnswentworth on AI Control: Improving Safety Despite Intentional Subversion · 2025-01-21T16:26:39.340Z · LW · GW

I think control research has relatively little impact on X-risk in general, and wrote up the case against here.

Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.

Comment by johnswentworth on The Case Against AI Control Research · 2025-01-21T16:07:33.381Z · LW · GW

Addendum: Why Am I Writing This?

Because other people asked me to. I don’t particularly like getting in fights over the usefulness of other peoples’ research agendas; it’s stressful and distracting and a bunch of work, I never seem to learn anything actually useful from it, it gives me headaches, and nobody else ever seems to do anything useful as a result of such fights. But enough other people seemed to think this was important to write, so Maybe This Time Will Be Different.

Comment by johnswentworth on What's the Right Way to think about Information Theoretic quantities in Neural Networks? · 2025-01-19T17:37:05.876Z · LW · GW

The key question is: mutual information of what, with what, given what? The quantity is I(X;Y|Z); what are X, Y, and Z? Some examples:

  • Mutual information of the third and eighth token in the completion of a prompt by a certain LLM, given prompt and weights (but NOT the random numbers used to choose each token from the probability distribution output by the model for each token).
  • Mutual information of two patches of the image output by an image model, given prompt and weights (but NOT the random latents which get denoised to produce the image).
  • Mutual information of two particular activations inside an image net, given the prompt and weights (but NOT the random latents which get denoised).
  • Mutual information of the trained net's output on two different prompts, given the prompts and training data (but NOT the weights). Calculating this one would (in principle) involve averaging over values of the trained weights across many independent training runs.
  • Mutual information between the random latents and output image of an image model, given the prompt and weights. This is the thing which is trivial because the net is deterministic. All the others above are nontrivial.
Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-19T17:04:35.237Z · LW · GW

Wording that the way I'd normally think of it: roughly speaking, a human has well-defined values at any given time, and RL changes those values over time. Let's call that the change-over-time model.

One potentially confusing terminological point: the thing that the change-over-time model calls "values" is, I think, the thing I'd call "human's current estimate of their values" under my model. 

The main shortcoming I see in the change-over-time model is closely related to that terminological point. Under my model, there's this thing which represents the human's current estimate of their own values. And crucially, that thing is going to change over time in mostly the sort of way that beliefs change over time. (TBC, I'm not saying peoples' underlying values never change; they do. But I claim that most changes in practice are in estimates-of-values, not in actual values.) On the other hand, the change-over-time model is much more agnostic about how values change over time - it predicts that RL will change them somehow, but the specifics mostly depend on those reward signals. So these models make different predictions: my model makes a narrower prediction about about how things change over time - e.g. I predict that "I thought I valued X but realized I didn't really" is the typical case, "I used to value X but now I don't" is an atypical case. Of course this is difficult to check because most people don't carefully track the distinction between those two (so just asking them probably won't measure what we want), but it's in-principle testable.

There are probably other substantive differences, but that's the one which jumps out to me right now.

Comment by johnswentworth on Alexander Gietelink Oldenziel's Shortform · 2025-01-19T01:02:35.854Z · LW · GW

Easiest test would be to zip some trained net params, and also zip some randomly initialized standard normals of the same shape as the net params (including e.g. parameter names if those are in the net params file), and see if they get about the same compression.

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-17T17:27:49.086Z · LW · GW

There should be some part of your framework that’s hooked up to actual decision-making—some ingredient for which “I do things iff this ingredient has a high score” is tautological (cf. my “1.5.3 “We do things exactly when they’re positive-valence” should feel almost tautological”) . IIUC that’s “value function” in your framework. (Right?)

This part's useful, it does update me somewhat. In my own words: sometimes people do things which are obviously-even-in-the-moment not in line with their values. Therefore the thing hooked up directly to moment-to-moment decision making must not be values themselves, and can't be estimated-values either (since if it were estimated values, the divergence of behavior from values would not be obvious in the moment).

Previously my model was that it's the estimated values which are hooked up directly to decision making (though I incorrectly stated above that it was values), but I think I've now tentatively updated away from that. Thanks!

If your proposal is that some latent variable in the world-model gets a flag meaning “this latent variable is the Value Function”... then how does that flag wind up attached to that latent variable, rather than to some other latent variable? What if the world-model lacks any latent variable that looks like what the value function is supposed to look like?

I have lots of question marks around that part still. My current best guess is that the value-modeling part of the world model has a bunch of hardcoded structure to it (the details of which I don't yet know), so it's a lot less ontologically flexible than most of the human world model. That said, it would generally operate by estimating value-assignments to stuff in the underlying epistemic world-model, so it should at least be flexible enough to handle epistemic ontology shifts (even if the value-ontology on top is more rigid).

IMO the world model is in the cortex, the value function is in the striatum.

Minor flag which might turn out to matter: I don't think the value function is in the mind at all; the mind contains an estimate of the value function, but the "true" value function (insofar as it exists at all) is projected. (And to be clear, I don't mean anything unusually mysterious by this; it's just the ordinary way latent variables work. If a variable is latent, then the mind doesn't know the variable's "true" value at all, and the variable itself is projected in some sense, i.e. there may not be any "true" value of the variable out in the environment at all.)

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T21:57:44.414Z · LW · GW

Cool, I think we agree here more than I thought based on the comment at top of chain. I think the discussion in our other thread is now better aimed than this one.

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T21:45:13.019Z · LW · GW

One very important distinction which that description did not capture: in the value-RL-as-I-understand-it model, there's ground-truth reward function, and there's a model of the reward function. A learning algorithm continually updates the model of the reward function in a way that tends to systematically make it a better approximation of the ground-truth reward function. But crucially, the model of the reward function is not itself the "value function". The "value function" is a latent variable/construct internal to the model of the reward function. That distinction matters, because it allows for the model of the reward function to include some stuff other than pure values - like e.g. the possibility that heroin "hacks" the reward function in a way which decouples it from "actual values".

As with your description, the value function is used for model-based planning at inference time, but because the value function is not updating over time to itself match the ground-truth reward function, the agent is not well-modeled as a reward optimizer (e.g. it might not wirehead).

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T19:11:47.964Z · LW · GW

“Is scratching your nose right now something you desire?” Yes. “Is scratching your nose right now something you value?” Not really, no.

I think disagree with this example. And I definitely disagree with it in the relevant-to-the-broader-context sense that, under a value-aligned sovereign AI, if my nose is itchy then it should get scratched, all else equal. It may not be a very important value, but it's a value. (More generally, satisfying desires is itself a value, all else equal.)

I think “values”, as people use the term in everyday life, tends to be something more specific, where not only is the thing motivating, but it’s also motivating when you think about it in a self-reflective way. A.k.a. “X is motivating” AND “the-idea-of-myself-doing-X is motivating”. If I’m struggling to get out of bed, because I’m going to be late for work, then the feeling of my head remaining on the pillow is motivating, but the self-reflective idea of myself being in bed is demotivating. Consequently, I might describe the soft feeling of the pillow on my head as something I desire, but not something I value.

I do agree that that sort of reasoning is a pretty big central part of values. And it's especially central for cases where I'm trying to distinguish my "real" values from cases where my reward stream is "inaccurately" sending signals for some reason, like e.g. heroin.

Here's how I'd imagine that sort of thing showing up organically in a Value-RL-style system.

My brain is projecting all this value-structure out into the world; it's modelling reward signals as being downstream of "values" which are magically attached to physical stuff/phenomena/etc. Insofar as X has high value, the projection-machinery expects both that X itself will produce a reward, and that the-idea-of-myself-doing-X will produce a reward (... as well as various things of similar flavor, like e.g. "my friends will look favorably on me doing X, thereby generating a reward", "the idea of other people doing X will produce a reward", etc). If those things come apart, then the brain goes "hmm, something fishy is going on here, I'm not sure these rewards are generated by my True Values; maybe my reward stream is compromised somehow, or maybe my social scene is reinforcing things which aren't actually good for me, or...".

That's the sort of reasoning which should naturally show up in a Value RL style system capable of nontrivial model structure learning.

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T18:24:31.701Z · LW · GW

Seems false, though that specific experiment is not super cruxy. One central line of evidence: desires tend to be more myopic, values longer term or "deeper". That's exactly the sort of thing you'd expect if "desires" is mostly about immediate reward signals (or anticipated reward signals), whereas "values" is more about the (projected) upstream generators of those signals, potentially projected quite a ways back upstream.

(I suspect that you took my definition of "values" to be nearly synonymous with rewards or immediately anticipated rewards, which it very much is not; projected-upstream-generators-of-rewards are a quite different beast from rewards themselves, especially as we push further upstream.)

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T17:23:46.818Z · LW · GW

I could imagine an argument that the lack of ground truth makes multiagent aggregation disproportionately difficult, but that argument wouldn't run through different agents having different approximation-preferences. In general, I expect the sort of approximations relevant to this topic to be quite robustly universal, and specifically mostly information theoretic (think e.g. "distribution P approximates Q, quantified by a low KL divergence between the two").

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T16:40:44.152Z · LW · GW

Meta note: strong upvoted, very good quality comment.

"aligned": how does the ontology translation between the representation of the "generated optimized-looking stuff" and the representation of human values look like?

Yup, IMO the biggest piece unaddressed in this post is what "aligned" means between two goals, potentially in different ontologies to some extent.

I think your model of humans is too simplistic. E.g. at the very least it's lacking a distinction like between "ego-syntonic" and "voluntary" as in this post, though I'd probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.

I think the model sketched in the post is at roughly the right level of detail to talk about human values specifically, while remaining agnostic to lots of other parts of how human cognition works.

We haven't described value extrapolation.

  • (Or from an alternative perspective, our model of humans doesn't identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)

Yeah, my view on metapreferences is similar to my view on questions of how to combine the values of different humans: metapreferences are important, but their salience is way out of proportion to their importance. (... Though the disproportionality is much less severe for metapreferences than for interpersonal aggregation.)

Like, people notice that humans aren't always fully consistent, and think about what's the "right way" to resolve that, and one of the most immediate natural answers is "metapreferences!". And sometimes that is the right answer, but I view it as more of a last-line fallback for extreme cases. Most of time (I claim) the "right way" to resolve the inconsistency is to notice that people are frequently and egregiously wrong in their estimates of their own values (as evidenced by experiences like "I thought I wanted X, but in hindsight I didn't"), most of the perceived inconsistency comes from the estimates being wrong, and then the right question to focus on is instead "What does it even mean to be wrong about our own values? What's the ground truth?".

Comment by johnswentworth on What Is The Alignment Problem? · 2025-01-16T16:26:09.322Z · LW · GW

It is entirely plausible that the right unit of analysis is cultures/societies/humanity-as-a-whole rather than an individual. Exactly the same kinds of efficiency arguments then apply at the societal level, i.e. the society has to be doing something besides just brute-force trying heuristics and spreading tricks that work, in order to account for the efficiency we actually see.

The specific proposal has problems mostly orthogonal to the "agency of humans vs societies" question.

Comment by johnswentworth on johnswentworth's Shortform · 2025-01-13T20:17:51.019Z · LW · GW

True, but Buck's claim is still relevant as a counterargument to my claim about memetic fitness of the scheming story relative to all these other stories.