Posts
Comments
I disagree pretty strongly with the headline advice here. An ideal response would be to go through a typical sample of stories from some news source - for instance, I keep MIT Tech Review in my feedly because it has surprisingly useful predictive alpha if you invert all of its predictions. But that would take way more effort than I'm realistically going to invest right now, so absent that, I'll just abstractly lay out where I think this post's argument goes wrong.
The main thing most news tells me is what people are talking about, and what people are saying about it. Sadly, "what people are talking about" has very little correlation with what's important, and "what people are saying about it" is overwhelmingly noise, even when true (which it often isn't). In simulacrum terms, news is overwhelmingly a simulacrum 3 thing, and tells me very little about the underlying reality I'm actually interested in.
Sure, maybe there's some useful stuff buried in the pile of junk, but why sift through it? I do not need to know a few days or weeks earlier that AIXI is being gutted. I do not need to know a few weeks earlier that the slowdown OpenAI found in pretraining scaling had been formally reported-upon. (Also David and I had already noticed the signs of OpenAI having noticed that slowdown back in May of 2024, though even if we hadn't suspected until it was reported-upon in November, I still wouldn't need to know about it a few weeks earlier.) Just waiting for Zvi to put it in his newsletter is more than enough.
I have no idea. It's entirely plausible that one of us wrote the Claude bit in there months ago and then forgot about it.
The joke is that Claude somehow got activated on the editor, and added a line thanking itself for editing despite us not wanting it to edit anything and (as far as we've noticed) not editing anything else besides that one line.
Nope, we were in Overleaf.
... but also that's useful info, thanks.
Working on a paper with David, and our acknowledgments section includes a thankyou to Claude for editing. Neither David nor I remembers putting that acknowledgement there, and in fact we hadn't intended to use Clause for editing the paper at all nor noticed it editing anything at all.
That's basically Do What I Mean.
None that I know of; it's a topic ripe for exploration.
Bell Labs is actually my go-to example of a much-hyped research institution whose work was mostly not counterfactual; see e.g. here. Shannon's information theory is the only major example I know of highly counterfactual research at Bell Labs. Most of the other commonly-cited advances, like e.g. transistors or communication satellites or cell phones, were clearly not highly counterfactual when we look at the relevant history: there were other groups racing to make the transistor, and the communication satellite and cell phones were both old ideas waiting on the underlying technology to make them practical.
That said, Hamming did sit right next to Shannon during the information theory days IIRC, so his words do carry substantial weight here.
Good idea, but... I would guess that basically everyone who knew me growing up would say that I'm exactly the right sort of person for that strategy. And yet, in practice, I still find it has not worked very well. My attention has in fact been unhelpfully steered by local memetic currents to a very large degree.
For instance, I do love proving everyone else wrong, but alas reversed stupidity is not intelligence. People mostly don't argue against the high-counterfactuality important things, they ignore the high-counterfactuality important things. Trying to prove them wrong about the things they do argue about is just another way of having one's attention steered by the prevailing memetic currents.
The claim is not that either "solution" is sufficient for counterfactuality, it's that either solution can overcome the main bottleneck to counterfactuality. After that, per Amdahl's Law, there will still be other (weaker) bottlenecks to overcome, including e.g. keeping oneself focused on something important.
Hypothesis: for smart people with a strong technical background, the main cognitive barrier to doing highly counterfactual technical work is that our brains' attention is mostly steered by our social circle. Our thoughts are constantly drawn to think about whatever the people around us talk about. And the things which are memetically fit are (almost by definition) rarely very counterfactual to pay attention to, precisely because lots of other people are also paying attention to them.
Two natural solutions to this problem:
- build a social circle which can maintain its own attention, as a group, without just reflecting the memetic currents of the world around it.
- "go off into the woods", i.e. socially isolate oneself almost entirely for an extended period of time, so that there just isn't any social signal to be distracted by.
These are both standard things which people point to as things-historically-correlated-with-highly-counterfactual-work. They're not mutually exclusive, but this model does suggest that they can substitute for each other - i.e. "going off into the woods" can substitute for a social circle with its own useful memetic environment, and vice versa.
I think you should address Thane's concrete example:
For example: "fully-connected neural networks → transformers" definitely wasn't continuous.
That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.
If that AI produces slop, it should be pretty explicitly aware that it's producing slop.
This part seems false.
As a concrete example, consider a very strong base LLM. By assumption, there exists some prompt such that the LLM will output basically the same alignment research you would. But with some other prompt, it produces slop, because it accurately predicts what lots of not-very-competent humans would produce. And when producing the sort of slop which not-very-competent humans produce, there's no particular reason for it to explicitly think about what a more competent human would produce. There's no particular reason for it to explicitly think "hmm, there probably exist more competent humans who would produce different text than this". It's just thinking about what token would come next, emulating the thinking of low-competence humans, without particularly thinking about more-competent humans at all.
How many of these failure modes still happen when there is an AI at least as smart as you, that is aware of these failure modes and actively trying to prevent them?
All of these failure modes apply when the AI is at least as smart as you and "aware of these failure modes" in some sense. It's the "actively trying to prevent them" part which is key. Why would the AI actively try to prevent them? Would actively trying to prevent them give lower perplexity or higher reward or a more compressible policy? Answer: no, trying to prevent them would not give lower perplexity or higher reward or a more compressible policy.
The alignment problem does not assume AI needs to be kept in check, it is not focused on control, and adaptation and learning in synergy are entirely compatible with everything said in this post. At a meta level, I would recommend actually reading rather than dropping GPT2-level comments which clearly do not at all engage with what the post is talking about.
I do agree that the end of the last glacial period was the obvious immediate trigger for agriculture. But the "humans are the stupidest thing which could take off model" still holds, because evolution largely operates on a slower timescale than the glacial cycle.
Specifics: the last glacial period ran from roughly 115k years ago to 12k years ago. Whereas, if you look at a timeline of human evolution, most of the evolution from apes to humans happens on a timescale of 100k - 10M years. So it's really only the very last little bit where an ice age was blocking takeoff. In particular, if human intelligence has been at evolutionary equilibrium for some time, then we should wonder why humanity didn't take off 115k years ago, before the last ice age.
Noting for the sake of later evaluation: this rough picture matches my current median expectations. Not very high confidence; I'd give it roughly 60%.
The evidence I have mentally cached is brain size. The evolutionary trajectory of brain size is relatively easy to measure just by looking at skulls from archaeological sites, and IIRC it has increased steadily through human evolutionary history and does not seem to be in evolutionary equilibrium.
(Also on priors, even before any evidence, we should strongly expect humans to not be in evolutionary equilibrium. As the saying goes, "humans are the stupidest thing which could take off, otherwise we would have taken off sooner". I.e. since the timescale of our takeoff is much faster than evolution, the only way we could be at equilibrium is if a maximal-under-constraints intelligence level just happened to be exactly enough for humans to take off.)
There's probably other kinds of evidence as well; this isn't a topic I've studied much.
IIUC human intelligence is not in evolutionary equilibrium; it's been increasing pretty rapidly (by the standards of biological evolution) over the course of humanity's development, right up to "recent" evolutionary history. So difficulty-of-improving-on-a-system-already-optimized-by-evolution isn't that big of a barrier here, and we should expect to see plenty of beneficial variants which have not yet reached fixation just by virtue of evolution not having had enough time yet.
(Of course separate from that, there are also the usual loopholes to evolutionary optimality which you listed - e.g. mutation load or variants with tradeoffs in the ancestral environment. But on my current understanding those are a minority of the available gains from human genetic intelligence enhancement.)
Alternative model: French mathematicians don't overperform in an objective sense. Rather, French mathematicians happened to end up disproportionately setting fashion trends in pure mathematics for a while, for reasons which are mostly just signalling games and academic politics rather than mathematical merit.
The Bourbaki spring to mind here as a central example.
That's a much more useful answer, actually. So let's bring it back to Eliezer's original question:
Can you tl;dr how you go from "humans cannot tell which alignment arguments are good or bad" to "we justifiably trust the AI to report honest good alignment takes"? Like, not with a very large diagram full of complicated parts such that it's hard to spot where you've messed up. Just whatever simple principle you think lets you bypass GIGO.
[...]
Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from "I can verify whether this problem was solved" to "I can train a generator to solve this problem".
So to summarize your short, simple answer to Eliezer's question: you want to "train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end". And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
Or, to compact the summary even further: you want to train the somewhat-smarter-than-human AI on easily-verifiable synthetically-generated tasks, and then hope/expect that its good performance on those tasks generalizes to a problem which is not easily verifiable or synthetically generated, namely the problem of checking that a next generation of AI is in the basin of attraction of a good-to-humans outcome.
(Note: I know you've avoided talking about the basin of attraction of a good-to-humans outcome, instead focused on just some short-term goal like e.g. not being killed by the very next generation of AI. Not focusing on the basin of attraction is a mistake, and we can go into why it's a mistake if that turns out to be cruxy.)
In Eliezer's comment, he was imagining a training setup somewhat different from easily-verifiable synthetically-generated tasks:
Assume that whenever OpenPhil tries to run an essay contest for saying what they're getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
... but the analogue of the problem Eliezer was highlighting, in the context of training on easily-verifiable synthetically-generated tasks, is the question: how and why would we justifiably trust that an AI trained on easily-verifiable synthetic tasks generalizes to not-easily-verifiable real-world tasks?
The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
This seems very blatantly not viable-in-general, in both theory and practice.
On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation requirements are, in fact, a thing. So you can't just have a bunch of Eliezers do in 3 months what a single 6 month Eliezer could do, in general.
Even if you allow the 3-month Eliezer sims to act one-after-another, rather than trying to get everything done in 3 months simulated time via pure parallelism, there's still a tight communication bottleneck between each Eliezer sim. There are presumably plenty of circuits which cannot be implemented with tight information bottlenecks every n serial steps.
... of course you could circumvent all that theory by e.g. just having each Eliezer emulate a single logic gate, or some comparably trivial thing, but at that point you run afoul of the non-compositionality of safety properties: putting together a bunch of "safe" things (or "interpretable" things, or "aligned" things, or "corrigible" things, ...) does not in-general produce a safe thing.
So that's the theory side. What about the practice side?
Well, Ought did that roughly that experiment years ago and it did not work at all. And that should not be surprising - as the link argues, we have extremely ample evidence from day-to-day life that such things do not work in general.
I think that's basically right, and good job explaining it clearly and compactly.
I would also highlight that it's not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.
Now to get a little more harsh...
Without necessarily accusing Kaj specifically, this general type of argument feels motivated to me. It feels like willful ignorance, like sticking one's head in the sand and ignoring the available information, because one wants to believe that All Research is Valuable or that one's own research is valuable or some such, rather than facing the harsh truth that much research (possibly one's own) is predictably-in-advance worthless.
That sort of reasoning makes sense insofar as it's hard to predict which small pieces will be useful. And while that is hard to some extent, it is not full we-just-have-no-idea-so-use-a-maxent-prior hard. There is plenty of work (including lots of research which people sink their lives into today) which will predictably-in-advance be worthless. And robust generalizability is the main test I know of for that purpose.
Applying this to your own argument:
Often when I've had a hypothesis about something that interests me, I've been happy that there has been *so much* scientific research done on various topics, many of them seemingly insignificant. While most of it is of little interest to me, the fact that there's so much of it means that there's often some prior work on topics that do interest me.
It will predictably and systematically be the robustly generalizable things which are relevant to other people in unexpected ways.
Yup, if you actually have enough knowledge to narrow it down to e.g. a 65% chance of one particular major route, then you're good. The challenging case is when you have no idea what the options even are for the major route, and the possibility space is huge.
Yeah ok. Seems very unlikely to actually happen, and unsure whether it would even work in principle (as e.g. scaling might not take you there at all, or might become more resource intensive faster than the AIs can produce more resources). But I buy that someone could try to intentionally push today's methods (both AI and alignment) to far superintelligence and simply turn down any opportunity to change paradigm.
Aliens kill you due to slop, humans depend on the details.
The basic issue here is that the problem of slop (i.e. outputs which look fine upon shallow review but aren't fine) plus the problem of aligning a parent-AI in such a way that its more-powerful descendants will robustly remain aligned, is already the core of the superintelligence alignment problem. You need to handle those problems in order to safely do the handoff, and at that point the core hard problems are done anyway. Same still applies to aliens: in order to safely do the handoff, you need to handle the "slop/nonslop is hard to verify" problem, and you need to handle the "make sure agents the aliens build will also be aligned, and their children, etc" problem.
It's not clear to me we'll have (or will "need") new paradigms before fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks.
If you want to not die to slop, then "fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks" not a thing which happens at all until the full superintelligence alignment problem is solved. That is how you die to slop.
If by superintelligence, you mean wildly superhuman AI, it remains non-obvious to me that new paradigms are needed (though I agree they will pretty likely arise prior to this point due to AIs doing vast quantity of research if nothing else). I think thoughtful and laborious implementation of current paradigm strategies (including substantial experimentation) could directly reduce risk from handing off to superintelligence down to perhaps 25% and I could imagine being argued considerably lower.
I find it hard to imagine such a thing being at all plausible. Are you imagining that jupiter brains will be running neural nets? That their internal calculations will all be differentiable? That they'll be using strings of human natural language internally? I'm having trouble coming up with any "alignment" technique of today which would plausibly generalize to far superintelligence. What are you picturing?
This post seems to assume that research fields have big hard central problems that are solved with some specific technique or paradigm.
This isn't always true. [...]
I would say it is basically-always true, but there are some fields (including deep learning today, for purposes of your comment) where the big hard central problems have already been solved, and therefore the many small pieces of progress on subproblems are all of what remains.
And insofar as there remains some problem which is simply not solvable within a certain paradigm, that is a "big hard central problem", and progress on the smaller subproblems of the current paradigm is unlikely by-default to generalize to whatever new paradigm solves that big hard central problem.
I agree that paradigm shifts can invalidate large amounts or prior work, but it isn't obvious whether this will occur in AI safety.
I claim it is extremely obvious and very overdetermined that this will occur in AI safety sometime between now and superintelligence. The question which you'd probably find more cruxy is not whether, but when - in particular, does it come before or after AI takes over most of the research?
... but (I claim) that shouldn't be the cruxy question, because we should not be imagining completely handing off the entire alignment-of-superintelligence problem to early transformative AI; that's a recipe for slop. We ourselves need to understand a lot about how things will generalize beyond the current paradigm, in order to recognize when that early transformative AI is itself producing research which will generalize beyond the current paradigm, in the process of figuring out how to align superintelligence. If an AI assistant produces alignment research which looks good to a human user, but won't generalize across the paradigm shifts between here and superintelligence, then that's a very plausible way for us to die.
Most importantly, current proposed technical plans are necessary but not sufficient to stop this. Even if the technical side fully succeeds no one knows what to do with that.
I don't think that's quite accurate. In particular, gradual disempowerment is exactly the sort of thing which corrigibility would solve. (At least for "corrigibility" in the sense David and I use the term, and probably Yudkowsky, but not Christiano's sense; he uses the term to mean a very different thing.)
A general-purpose corrigible AI (in the sense we use the term) is pretty accurately thought-of as an extension of the user. Building and using such an AI is much more like "uplifting" the user than like building an independent agent. It's the cognitive equivalent of gaining prosthetic legs, as opposed to having someone carry you around on a sedan. Another way to state it: a corrigible subsystem acts like it's a part of a larger agent, serving a particular purpose as a component of the larger agent, as opposed to acting like an agent in its own right.
... admittedly corrigibility is still very much in the "conceptual" stage, far from an actual technical plan. But it's at least a technical research direction which would pretty directly address the disempowerment problem.
I think that accurately summarizes the concrete things, but misses a mood.
The missing mood is less about "grappling with the enormity of it", and more about "grappling with the effort of it" or "accepting the need to struggle for real". Like, in terms of time spent, we're talking maybe a year or two of full-time effort, spread out across maybe 3-5 years. That's far more than the post grapples with, but not prohibitively enormous; it's comparable to getting a degree. The missing mood is more about "yup, it's gonna be hard, and I'm gonna have to buckle down and do the hard thing for reals, not try to avoid it". The technical knowledge is part of that - like, "yup, I'm gonna have to actually for-reals learn some gnarly technical stuff, not try to avoid it". But not just the technical study. For instance, actually for-reals noticing and admitting when I'm avoiding unpleasant truths, or when my plans won't work and I need to change tack, has a similar feel: "yup, my plans are actually for-reals trash, I need to actually for-reals update, and I don't yet have any idea what to do instead, and it looks hard".
I don't claim to know all the key pieces of whatever they did, but some pieces which are obvious from talking to them both and following their writings:
- Both invested heavily in self-studying a bunch of technical material.
- Both heavily practiced the sorts of things Joe described in his recent "fake thinking and real thinking" post. For instance, they both have a trained habit of noticing the places where most people would gloss over things (especially technical/mathematical), and instead looking for a concrete example.
- Both heavily practiced the standard metacognitive moves, e.g. noticing when a mental move worked very well/poorly and reinforcing accordingly, asking "how could I have noticed that faster", etc.
- Both invested in understanding their emotional barriers and figuring out sustainable ways to handle them. (I personally think both of them have important shortcomings in that department, but they've at least put some effort into it and are IMO doing better than they would otherwise.)
- Both have a general mindset of actually trying to notice their own shortcomings and improve, rather than make excuses or hide.
- And finally: both put a pretty ridiculous amount of effort into improving, in terms of raw time and energy, and in terms of emotional investment, and in terms of "actually doing the thing for real not just talking or playing".
I feel like the "this will just take so much time" section doesn't really engage with the full-strength version of the critique.
When I think of people I know who have successfully gone from unremarkable to reasonably impressive via some kind of deliberate practice and training, the list consists of Nate Soares and Alex Turner. That's it; that's the entire list. Notably, they both followed a pretty similar path to get there (not by accident). And that path is not short.
Sure, you can do a lightweight version of "feedbackloop rationality" which just involves occasional short reviews or whatever. But that does not achieve most of the value.
My pitch to someone concerned about the timesink would instead be roughly: "Look, you know deep down that you are incompetent and don't know what you're doing (in the spirit of Impostor Syndrome and the Great LARP). You're clinging to these stories about how the thing you're currently doing happens to be useful somehow, but some part of you knows perfectly well that you're doing your current work mainly just because you can do your current work without facing the scary fact of your own ineptitude when faced with any actually useful (and difficult) problem. The only way you will ever do something of real value is if you level the fuck up. Is that going to be a big timesink and take a lot of effort? Yes. But guess what? There isn't actually a trade-off here. Your current work is not useful, you are not going to do anything useful without getting stronger, so how about you stop hiding from your own ineptitude and just go get stronger already?".
(And yes, that is the sort of pitch I sometimes make to myself when considering whether to dump time and effort into some form of deliberate practice.)
If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).
That is indeed what I had in mind when I said we'd need another couple sentences to argue that the agent maximizes expected utility under the distribution. It is less circular than it might seem at first glance, because two importantly different kinds of probabilities are involved: uncertainty over the environment (which is what we're deriving), and uncertainty over the agent's own actions arising from mixed strategies.
Big crux here: I don't actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result.
As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky and with Christiano. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs.
By contrast, when I read the stuff written on the control agenda... it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever's writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe's recent post on "fake vs real thinking" feels like it's pointing at the right thing here; the posts on control feel strongly like "fake" thinking.) And that's not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways.
... so mostly I've tried to argue at a different level, like e.g. in the Why Not Just... posts. The goal there isn't really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.
Even if all of those are true, the argument in the post would still imply that control research (at least of the sort people do today) cannot have very high expected value. Like, sure, let's assume for sake of discussion that most total AI safety research will be done by early transformative AI, that the only chance of aligning superintelligent AIs is to delegate, that control research is unusually tractable, and that for some reason we're going to use the AIs to pursue formal verification (not a good idea, but whatever).
Even if we assume all that, we still have the problem that control research of the sort people do today does basically-nothing to address slop; it is basically-exclusively focused on intentional scheming. Insofar as intentional scheming is not the main thing which makes outsourcing to early AIs fail, all that control research cannot have very high expected value. None of your bullet points address that core argument at all.
Just because the number of almost-orthogonal vectors in dimensions scales exponentially with , doesn't mean one can choose all those signals independently. We can still only choose real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So "more intended behaviors than input-vector components" just isn't an option, unless you're exploiting some kind of low-information-density in the desired behaviors (like e.g. very "sparse activation" of the desired behaviors, or discreteness of the desired behaviors to a limited extent).
TBC, I don't particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up "property rights". So, a system can be generally corrigible by "respecting the convergent property rights", so to speak.
40 standard deviations away from natural human IQ would have an IQ of 600
Nitpick: 700.
Yup, exactly, and good job explaining it too.
Yup, I'm familiar with that one. The big difference is that I'm backward-chaining, whereas that post forward chains; the hope of backward chaining would be to identify big things which aren't on peoples' radar as nootropics (yet).
(Relatedly: if one is following this sort of path, step 1 should be a broad nutrition panel and supplementing anything in short supply, before we get to anything fancier.)
Here's a side project David and I have been looking into, which others might have useful input on...
Background: Thyroid & Cortisol Systems
As I understand it, thyroid hormone levels are approximately-but-accurately described as the body's knob for adjusting "overall metabolic rate" or the subjective feeling of needing to burn energy. Turn up the thyroid knob, and people feel like they need to move around, bounce their leg, talk fast, etc (at least until all the available energy sources are burned off and they crash). Turn down the thyroid knob, and people are lethargic.
That sounds like the sort of knob which should probably typically be set higher, today, than was optimal in the ancestral environment. Not cranked up to 11; hyperthyroid disorders are in fact dangerous and unpleasant. But at least set to the upper end of the healthy range, rather than the lower end.
... and that's nontrivial. You can just dump the relevant hormones (T3/T4) into your body, but there's a control system which tries to hold the level constant. Over the course of months, the thyroid gland (which normally produces T4) will atrophy, as it shrinks to try to keep T4 levels fixed. Just continuing to pump T3/T4 into your system regularly will keep you healthy - you'll basically have a hypothyroid disorder, and supplemental T3/T4 is the standard treatment. But you better be ready to manually control your thyroid hormone levels indefinitely if you start down this path. Ideally, one would intervene further up the control loop in order to adjust the thyroid hormone set-point, but that's more of a research topic than a thing humans already have lots of experience with.
So that's thyroid. We can tell a similar story about cortisol.
As I understand it, the cortisol hormone system is approximately-but-accurately described as the body's knob for adjusting/tracking stress. That sounds like the sort of knob which should probably be set lower, today, than was optimal in the ancestral environment. Not all the way down; problems would kick in. But at least set to the lower end of the healthy range.
... and that's nontrivial, because there's a control loop in place, etc. Ideally we'd intervene on the relatively-upstream parts of the control loop in order to change the set point.
We'd like to generalize this sort of reasoning, and ask: what are all the knobs of this sort which we might want to adjust relative to their ancestral environment settings?
Generalization
We're looking for signals which are widely broadcast throughout the body, and received by many endpoints. Why look for that type of thing? Because the wide usage puts pressure on the signal to "represent one consistent thing". It's not an accident that there are individual hormonal signals which are approximately-but-accurately described by the human-intuitive phrases "overall metabolic rate" or "stress". It's not an accident that those hormones' signals are not hopelessly polysemantic. If we look for widely-broadcast signals, then we have positive reason to expect that they'll be straightforwardly interpretable, and therefore the sort of thing we can look at and (sometimes) intuitively say "I want to turn that up/down".
Furthermore, since these signals are widely broadcast, they're the sort of thing which impacts lots of stuff (and is therefore impactful to intervene upon). And they're relatively easy to measure, compared to "local" signals.
The "wide broadcast" criterion helps focus our search a lot. For instance, insofar as we're looking for chemical signals throughout the whole body, we probably want species in the bloodstream; that's the main way a concentration could be "broadcast" throughout the body, rather than being a local signal. So, basically endocrine hormones.
Casting a slightly wider net, we might also be interested in:
- Signals widely broadcast through the body by the nervous system.
- Chemical signals widely broadcast through the brain specifically (since that's a particularly interesting/relevant organ).
- Non-chemical signals widely broadcast through the brain specifically.
... and of course for all of these there will be some control system, so each has its own tricky question about how to adjust it.
Some Promising Leads, Some Dead Ends
With some coaxing, we got a pretty solid-sounding list of endocrine hormones out of the LLMs. There were some obvious ones on the list, including thyroid and cortisol systems, sex hormones, and pregnancy/menstruation signals. There were also a lot of signals for homeostasis of things we don't particularly want to adjust: salt balance, calcium, digestion, blood pressure, etc. There were several inflammation and healing signals, which we're interested in but haven't dug into yet. And then there were some cool ones: oxytocin (think mother-child bonding), endocannabinoids (think pot), satiety signals (think Ozempic). None of those really jumped out as clear places to turn a knob in a certain direction, other than obvious things like "take ozempic if you are even slightly overweight" and the two we already knew about (thyroid and cortisol).
Then there were neuromodulators. Here's the list we coaxed from the LLMs:
- Dopamine: Tracks expected value/reward - how good things are compared to expectations.
- Norepinephrine: Sets arousal/alertness level - how much attention and energy to devote to the current situation.
- Serotonin: Regulates resource availability mindset - whether to act like resources are plentiful or scarce. Affects patience, time preference, and risk tolerance.
- Acetylcholine: Controls signal-to-noise ratio in neural circuits - acts like a gain/precision parameter, determining whether to amplify precise differences (high ACh) or blur things together (low ACh).
- Histamine: Manages the sleep/wake switch - promotes wakefulness and suppresses sleep when active.
- Orexin: Acts as a stability parameter for brain states - increases the depth of attractor basins and raises transition barriers between states. Higher orexin = stronger attractors = harder to switch states.
Of those, serotonin immediately jumps out as a knob you'd probably want to turn to the "plentiful resources" end of the healthy spectrum, compared to the ancestral environment. That puts the widespread popularity of SSRIs in an interesting light!
Moving away from chemical signals, brain waves (alpha waves, theta oscillations, etc) are another potential category - they're oscillations at particular frequencies which (supposedly) are widely synced across large regions of the brain. I read up just a little, and so far have no idea how interesting they are as signals or targets.
Shifting gears, the biggest dead end so far has been parasympathetic tone, i.e. overall activation level of the parasympathetic nervous system. As far as I can tell, parasympathetic tone is basically Not A Thing: there are several different ways to measure it, and the different measurements have little correlation. It's probably more accurate to think of parasympathetic nervous activity as localized, without much meaningful global signal.
Anybody see obvious things we're missing?
However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next".
That's an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.
The "uncertainty regarding the utility function" enters here mainly when we invoke instrumental convergence, in hopes that the subagent can "act as though other subagents are also working torward its terminal goal" in a way agnostic to its terminal goal. Which is a very different role than the old "corrigibility via uncertainty" proposals.
Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say "corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals"; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.
I think this misunderstands the idea, mainly because it's framing things in terms of subagents rather than subgoals. Let me try to illustrate the picture in my head. (Of course at this stage it's just a hand-wavy mental picture, I don't expect to have the right formal operationalization yet.)
Imagine that the terminal goal is some optimization problem. Each instrumental goal is also an optimization problem, with a bunch of constraints operationalizing the things which must be done to avoid interfering with other subgoals. The instrumental convergence we're looking for here is mainly in those constraints; we hope to see that roughly the same constraints show up in many instrumental goals for many terminal goals. Insofar as we see convergence in the constraints, we can forget about the top-level goal, and expect that a (sub)agent which respects those constraints will "play well" in an environment with other (sub)agents trying to achieve other instrumental and/or terminal goals.
Then, addressing this part specifically:
I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.
... that would only happen insofar as converting the universe into batteries, computers and robots can typically be done without interfering with other subgoals, for a wide variety of terminal objectives. If it does interfere with other subgoals (for a wide variety of terminal objectives), then the constraints would say "don't do that".
And to be clear, maybe there would be some large-scale battery/computer/robot building! But it would be done in a way which doesn't step on the toes of other subplans, and makes the batteries/computers/robots readily available and easy to use for those other subplans.
As in, the claim is that there is almost always a "schelling" mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn't make much difference?
The latter.
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than "run it twice on the original question and a rephrasing/transformation of the question".
But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).
Strong disagree with this. Probably not the most cruxy thing for us, but I'll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn't work is that the prototypical "blatant lie" doesn't look like "the model chooses a random number to output". The prototypical blatant lie is that there's a subtle natural mistake one could make in reasoning about the question, the model "knows" that it's a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model "knows" what kind of argument would seem most natural to the user, and presents that kind of argument despite "knowing" that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there's a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF'd and got positive feedback for matching supposed-experts' answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal "knows" the right answer in some sense.
The first three aren't addressable by any technical research or solution. Corporate leaders might be greedy, hubristic, and/or reckless. Or human organizations might not be nimble enough to effect development and deployment of the maximum safety we are technically capable of. No safety research portfolio addresses those risks. The other four are potential failures by us as a technical community that apply broadly. If too high a percentage of the people in our space are bad statisticians, can't think distributionally, are lazy or prideful, or don't understand causal reasoning well enough, that will doom all potential directions of AI safety research, not just AI control.
Technical research can have a huge impact on these things! When a domain is well-understood in general (think e.g. electrical engineering), it becomes far easier and cheaper for human organizations to successfully coordinate around the technical knowledge, for corporate leaders to use it, for bureaucracies to build regulations based on its models, for mid researchers to work in the domain without deceiving or confusing themselves, etc. But that all requires correct and deep technical understanding first.
Now, you are correct that a number of other AI safety subfields suffer from the same problems to some extent. But that's a different discussion for a different time.