A Crisper Explanation of Simulacrum Levels 2023-12-23T22:13:52.286Z
Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) 2023-12-22T20:19:13.865Z
Most People Don't Realize We Have No Idea How Our AIs Work 2023-12-21T20:02:00.360Z
How Would an Utopia-Maximizer Look Like? 2023-12-20T20:01:18.079Z
Don't Share Information Exfohazardous on Others' AI-Risk Models 2023-12-19T20:09:06.244Z
The Shortest Path Between Scylla and Charybdis 2023-12-18T20:08:34.995Z
A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans 2023-12-17T20:28:57.854Z
"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity 2023-12-16T20:08:39.375Z
Current AIs Provide Nearly No Data Relevant to AGI Alignment 2023-12-15T20:16:09.723Z
Hands-On Experience Is Not Magic 2023-05-27T16:57:10.531Z
A Case for the Least Forgiving Take On Alignment 2023-05-02T21:34:49.832Z
World-Model Interpretability Is All We Need 2023-01-14T19:37:14.707Z
Internal Interfaces Are a High-Priority Interpretability Target 2022-12-29T17:49:27.450Z
In Defense of Wrapper-Minds 2022-12-28T18:28:25.868Z
Accurate Models of AI Risk Are Hyperexistential Exfohazards 2022-12-25T16:50:24.817Z
Corrigibility Via Thought-Process Deference 2022-11-24T17:06:39.058Z
Value Formation: An Overarching Model 2022-11-15T17:16:19.522Z
Greed Is the Root of This Evil 2022-10-13T20:40:56.822Z
Are Generative World Models a Mesa-Optimization Risk? 2022-08-29T18:37:13.811Z
AI Risk in Terms of Unstable Nuclear Software 2022-08-26T18:49:53.726Z
Broad Picture of Human Values 2022-08-20T19:42:20.158Z
Interpretability Tools Are an Attack Channel 2022-08-17T18:47:28.404Z
Convergence Towards World-Models: A Gears-Level Model 2022-08-04T23:31:33.448Z
What Environment Properties Select Agents For World-Modeling? 2022-07-23T19:27:49.646Z
Goal Alignment Is Robust To the Sharp Left Turn 2022-07-13T20:23:58.962Z
Reframing the AI Risk 2022-07-01T18:44:32.478Z
Is This Thing Sentient, Y/N? 2022-06-20T18:37:59.380Z
The Unified Theory of Normative Ethics 2022-06-17T19:55:19.588Z
Towards Gears-Level Understanding of Agency 2022-06-16T22:00:17.165Z
Poorly-Aimed Death Rays 2022-06-11T18:29:55.430Z
Reshaping the AI Industry 2022-05-29T22:54:31.582Z
Agency As a Natural Abstraction 2022-05-13T18:02:50.308Z


Comment by Thane Ruthenis on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-04-08T21:04:41.743Z · LW · GW

I'd say one of the main reasons is because military-AI technology isn't being optimized towards things we're afraid of. We're concerned about generally intelligent entities capable of e. g. automated R&D and social manipulation and long-term scheming. Military-AI technology, last I checked, was mostly about teaching drones and missiles to fly straight and recognize camouflaged tanks and shoot designated targets while not shooting not designated targets.

And while this still may result in a generally capable superintelligence in the limit (since "which targets would my commanders want me to shoot?" can be phrased as a very open-ended problem), it's not a particularly efficient way to approach this limit at all. Militaries, so far, just aren't really pushing in the directions where doom lies, while the AGI labs are doing their best to beeline there.

The proliferation of drone armies that could be easily co-opted by a hostile superintelligence... It doesn't have no impact on p(doom), but it's approximately a rounding error. A hostile superintelligence doesn't need extant drone armies; it could build its own, and co-opt humans in the meantime.

Comment by Thane Ruthenis on TurnTrout's shortform feed · 2024-03-05T10:58:39.240Z · LW · GW

I think that the key thing we want to do is predict the generalization of future neural networks.

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don't need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.

I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don't know what training/design process would get us to AGI. Which means we can't make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.

And I'm not seeing a lot of ironclad arguments that favour "pretraining + RLHF is going to get us to AGI" over "pretraining + RLHF is not going to get us to AGI". The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn't.

Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.

I'd be interested if you elaborated on that.

Comment by Thane Ruthenis on A Case for the Least Forgiving Take On Alignment · 2024-02-23T06:10:37.632Z · LW · GW

I wouldn't call Shard Theory mainstream

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis[1]).

judging by how bad humans are at [consistent decision-making], and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this

Roughly agree, yeah.

But all of this is still just one piece on the Jenga tower

I kinda want to push back against this repeat characterization – I think quite a lot of my model's features are "one storey tall", actually – but it probably won't be a very productive use of the time of either of us. I'll get around to the "find papers empirically demonstrating various features of my model in humans" project at some point; that should be a more decent starting point for discussion.

What I want is to build non-Jenga-ish towers

Agreed. Working on it.

  1. ^

    Which, yeah, I think is false: scaling LLMs won't get you to AGI. But it's also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.

Comment by Thane Ruthenis on AI #52: Oops · 2024-02-23T00:23:25.600Z · LW · GW

the model chose slightly wrong numbers

The engraving on humanity's tombstone be like.

Comment by Thane Ruthenis on A Case for the Least Forgiving Take On Alignment · 2024-02-22T19:11:21.770Z · LW · GW

The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model

My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena.

Again, the drive for consistent decision-making is a good example. Common-sensically, I don't think we'd disagree that humans want their decisions to be consistent. They don't want to engage in wild mood swings, they don't want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources.

Yet it's not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You'd need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).

Comment by Thane Ruthenis on AI #51: Altman’s Ambition · 2024-02-22T00:55:09.899Z · LW · GW

I find the idea of morality being downstream from the free energy principle very interesting

I agree that there are some theoretical curiosities in the neighbourhood of the idea. Like:

  • Morality is downstream of generally intelligent minds reflecting on the heuristics/shards.
    • Which are downstream of said minds' cognitive architecture and reinforcement circuitry.
      • Which are downstream of the evolutionary dynamics.
        • Which are downstream of abiogenesis and various local environmental conditions.
          • Which are downstream of the fundamental physical laws of reality.

Thus, in theory, if we plug all of these dynamics one into another, and then simplify the resultant expression, we should actually get a (probability distribution over) the utility function that is "most natural" for this universe to generate! And the expression may indeed be relatively simple and have something to do with thermodynamics, especially if some additional simplifying assumptions are made.

That actually does seem pretty exciting to me! In an insight-porn sort of way.

Not in any sort of practical way, though[1]. All of this is screened off by the actual values actual humans actually have, and if the noise introduced at every stage of this process caused us to be aimed at goals wildly diverging from the "most natural" utility function of this universe... Well, sucks to be that utility function, I guess, but the universe screwed up installing corrigibility into us and the orthogonality thesis is unforgiving.

  1. ^

    At least, not with regards to AI Alignment or human morality. It may be useful for e. g. acausal trade/acausal normalcy: figuring out the prior for what kinds of values aliens are most likely to have, etc.[2]

  2. ^

    Or maybe for roughly figuring out what values the AGI that kills us all is likely going to have, if you've completely despaired of preventing that, and founding an apocalypse cult worshiping it. Wait a minute...

Comment by Thane Ruthenis on A Case for the Least Forgiving Take On Alignment · 2024-02-22T00:10:52.349Z · LW · GW

I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.

I think my main problem with this is that it isn't based on anything

Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which "provide scientific evidence" that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory, situations in which humans do try to behave approximately like rational agents... There probably also are some psychology-survey results demonstrating stuff like "yes, humans do commonly report wanting to be consistent in their decision-making rather than undergoing wild mood swings and acting at odds with their own past selves", which would "provide evidence" for the hypothesis that complex minds want their utilities to be coherent.

That's actually an interesting idea! This is basically what my model is based on, after a fashion, and it makes arguments-from-introspection "legible" instead of seeming to be arbitrary philosophical navel-gazing.

Unfortunately, I didn't have this idea until a few minutes ago, so I haven't been compiling a list of "primary sources". Most of them are lost to time, so I can't compose a decent object-level response to you here. (The Wikipedia links are probably a decent starting point, but I don't expect you to trawl through all that.)

Still, that seems like a valuable project. I'll put a pin in it, maybe post a bounty for relevant papers later.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-02-21T23:24:28.501Z · LW · GW

Do you think a car engine is in the same reference class as a car? Do you think "a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!" is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can't go from an engine to a car with relatively little additional engineering effort (putting it in a "wrapper", as it happens)?

As all analogies, this one is necessarily flawed, but I hope it gets the point across.

(Except in this case, it's not even that we've figured out how to build engines. It's more like, we have these wild teams of engineers we can capture, and we've figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we're wondering how far we are from figuring out which project specifications would cause them to build a car.)

Comment by Thane Ruthenis on More Hyphenation · 2024-02-08T01:33:49.341Z · LW · GW

I agree.

Relevant problem: how should one handle higher-order hyphenation? E. g., imagine if one is talking about cost-effective measures, but has the measures' effectiveness specifically relative to marginal costs in mind. Building it up, we have "marginal-cost effectiveness", and then we want to turn that whole phrase into a compound modifier. But "marginal-cost-effective measures" looks very awkward! We've effectively hyphenated "marginal cost effectiveness", no hyphen: within the hyphenated expression, we have no way to avoid the ambiguities between a hyphen and a space!

It becomes especially relevant in the case of longer composite modifiers, like your "responsive-but-not-manipulative" example.

Can we fix that somehow?

One solution I've seen in the wild is to increase the length of the hyphen depending on its "degree", i. e. use an en dash in place of a hyphen. Example: "marginal-cost–effective measures". (On Windows, can be inserted by typing 0150 on the keypad while holding ALT. See methods for other platforms here.)

In practice you basically never go beyond the second-degree expressions, but there's space to expand to third-degree expressions by the use of an even-longer em dash (—, 0151 while holding ALT).

Though I expect it's not "official" rules at all.

Comment by Thane Ruthenis on Brute Force Manufactured Consensus is Hiding the Crime of the Century · 2024-02-05T04:27:21.881Z · LW · GW

That seems to generalize to "no-one is allowed to make any claim whatsoever without consuming all of the information in the world".

Just because someone generated a vast amount of content analysing the topic, does not mean you're obliged to consume it before forming your opinions. Nay, I think consuming all object-level evidence should be considered entirely sufficient (which I assume was done in this case). Other people's analyses based on the same data are basically superfluous, then.

Even less than that, it seems reasonable to stop gathering evidence the moment you don't expect any additional information to overturn the conclusions you've formed (as long as you're justified in that expectation, i. e. if you have a model of the domain strong enough to have an idea regarding what sort of additional (counter)evidence may turn up and how you'd update on it).

Comment by Thane Ruthenis on Most experts believe COVID-19 was probably not a lab leak · 2024-02-03T04:02:29.568Z · LW · GW

In addition to Roko's point that this sort of opinion-falsification is often habitual rather than a strategic choice that a person could opt not to make, it also makes strategic sense to lie in such surveys.

First, the promised "anonymity" may not actually be real, or real in the relevant sense. The methodology mentions "a secure online survey system which allowed for recording the identities of participants, but did not append their survey responses to their names or any other personally identifiable information", but if your reputation is on the line, would you really trust that? Maybe there's some fine print that'd allow the survey-takers to look at the data. Maybe there'd be a data leak. Maybe there's some other unknown-unknown you're overlooking. Point is, if you give the wrong response, that information can get out somehow; and if you don't, it can't. So why risk it?

Second, they may care about what the final anonymized conclusion says. Either because the lab leak hypothesis becoming mainstream would hurt them personally (either directly, or by e. g. hurting the people they rely on for funding), or because the final conclusion ending up in favour of the lab leak would still reflect poorly on them collectively. Like, if it'd end up saying that 90% of epidemiologists believe the lab leak, and you're an epidemiologist... Well, anyone you talk to professionally will then assign 90% probability that that's what you believe. You'd be subtly probed regarding having this wrong opinion, your past and future opinions would be scrutinized for being consistent with those of someone believing the lab leak, and if the status ecosystem notices something amiss...?

But, again, none of these calculations would be strategic. They'd be habitual; these factors are just the reasons why these habits are formed.

Answering truthfully in contexts-like-this is how you lose the status games. Thus, people who navigate such games don't.

Comment by Thane Ruthenis on Could there be "natural impact regularization" or "impact regularization by default"? · 2024-01-31T12:33:16.290Z · LW · GW

I think, like a lot of things in agent foundations, this is just another consequence of natural abstractions.

The universe naturally decomposes into a hierarchy of subsystems; molecules to cells to organisms to countries. Changes in one subsystem only sparsely interact with the other subsystems, and their impact may vanish entirely at the next level up. A single cell becoming cancerous may yet be contained by the immune system, never impacting the human. A new engineering technique pioneered for a specific project may generalize to similar projects, and even change all such projects' efficiency in ways that have a macro-economic impact; but it will likely not. A different person getting elected the mayor doesn't much impact city politics in neighbouring cities, and may literally not matter at the geopolitical scale.

This applies from the planning direction too. If you have a good map of the environment, it'll decompose into the subsystems reflecting the territory-level subsystems as well. When optimizing over a specific subsystem, the interventions you're considering will naturally limit their impact to that subsystem: that's what subsystemization does, and counteracting this tendency requires deliberately staging sum-threshold attacks on the wider system, which you won't be doing.

In the Rubik's Cube example, this dynamic is a bit more abstract, but basically still applies. In a way similar to how the "maze" here kind-of decomposes into a top side and a bottom side.

A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I've been thinking bandwidth is probably going to become a huge area of agent foundations

I agree. I currently think "bandwidth" in terms like "what's the longest message I can 'inject' into the environment per time-step?" is what "resources" are in information-theoretic terms. See the output-side bottleneck in this formulation: resources are the action bandwidth, which is the size of the "plan" into which you have to "compress" your desired world-state if you want to "communicate" it to the environment.

really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change

I disagree. I've given it a lot of thoughts (none published yet), but this sort of "precise influence" is something I call "inferential control". It allows you to maximize your impact given your action bottleneck, but this sort of optimization is "brittle". If something unknown unknown happens, the plan you've injected breaks instantly and gracelessly, because the fundamental assumptions on which its functionality relied – the pathways by which it meant to implement its objective – turn out to be invalid.

It sort of naturally favours arithmetic utility maximization over geometric utility maximization. By taking actions that'd only work if your predictions and models are true, you're basically sacrificing your selves living in the timelines that you're predicting to be impossible, and distributing their resources to the timelines you expect to find yourself in.

And this applies more and more the more "optimization capacity" you're trying to push through a narrow bottleneck. E. g., if you want to change the entire state of a giant environment through a tiny action-pinhole, you'd need to do it by exploiting some sort of "snowball effect"/"butterfly effect". Your tiny initial intervention would need to exploit some environmental structures to increase its size, and do so iteratively. That takes time (for whatever notion of "time" applies). You'd need to optimize over a longer stretch of environment-state changes, and your initial predictions need to be accurate for that entire stretch, because you'd have little ability to "steer" a plan that snowballed far beyond your pinhole's ability to control.

By contrast, increasing the size of your action bottleneck is pretty much the definition of "robust" optimization, i. e. geometric utility maximization. It improves your ability to control the states of all possible worlds you may find yourself in, minimizing the need for "brittle" inferential control. It increases your adaptability, basically, letting you craft a "message" comprehensively addressing any unpredicted crisis the environment throws at you, right in the middle of it happening.

Comment by Thane Ruthenis on Aligned AI is dual use technology · 2024-01-29T01:06:37.518Z · LW · GW

Nah, I think this post is about a third component of the problem: ensuring that the solution to "what to steer at" that's actually deployed is pro-humanity. A totalitarian government successfully figuring out how to load its regime's values into the AGI has by no means failed at figuring out "what to steer at". They know what they want and how to get it. It's just that we don't like the end result.

"Being able to steer at all" is a technical problem of designing AIs, "what to steer at" is a technical problem of precisely translating intuitive human goals into a formal language, and "where is the AI actually steered" is a realpolitiks problem that this post is about.

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-25T17:23:07.667Z · LW · GW

I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend

Agreed; I did gesture at that in the footnote.

I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human "means" actually still has to route through defining how we want to handle moral philosophy/value extrapolation.

E. g., suppose the AGI's operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?

  1. Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
  2. Should it extrapolate the human's values and execute the command the way the human would have wanted to execute it if they'd thought about it a lot, rather than the way they're envisioning it in the moment?
    • For example, perhaps the image flashing through the human's mind right now is of helicopters literally spraying the cure, but it's actually more efficient to do it using airplanes.
  3. Should it extrapolate the human's values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
  4. Should it extrapolate the human's values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
  5. Should it extrapolate the human's values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
  6. Should it extrapolate the human's values a lot, interpret the command as "maximize eudaimonia", and go do that, disregarding the specific way of how they gestured at the idea?
  7. Should it remind the human that they'd wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
  8. Etc.

There's quite a lot of different ways by which you can slice the idea. There's probably a way that corresponds to the intuitive meaning of "do what I mean", but maybe there isn't, and in any case we don't yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what "DWIM" means doesn't solve anything.)

And then, because of the general "unknown-unknown environmental structures" plus "compounding errors" problems, picking the wrong definition probably kills everyone.

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-25T15:48:27.483Z · LW · GW

I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution

Nah, I recall your takes tend to be considerably more reasonable than that.

I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don't agree that "rough knowledge of what humans tend to mean" is sufficient.

The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven't figured out yet. 

These structures might end up immediately relevant to whatever command we give, on the AI's better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.

For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.

The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command's complexity.

Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function

That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.

  1. ^

    My money's on our understanding of what we mean by "what we mean" being hopelessly confused, and that causing problems. Unless, again, we've figured out how to specify it in a mathematically precise manner – unless we know we're not confused.

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-23T16:54:24.838Z · LW · GW

The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It's simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human's decision where to send job applications and how to word them is rooted in what career they'd like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.

To our minds, there's a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn't objective: it's based on a very complicated human prior of what counts as normal/sane and what's excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.

"An AGI that just does what you tell it to" is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.

Whether that happens because we've attained so much mastery of moral philosophy that we could predict this process' outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John's post is outlining, shouldn't matter, I think. Whatever has the best chance of working.

And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you're likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn't to say they require no theory. Certainly the current AI theory is so lacking we can't even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-22T08:29:31.982Z · LW · GW

The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility... that class will also have trouble expressing human values.

The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn't necessarily imply that if we can't precisely specify that thing, it means we can't point the AI at the human values at all. The intuition here would be that "human values" are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific "value"-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)

Which isn't to say I buy that. My current standpoint is that "human values" are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.

Still, I'm nitpicking the exact form of the argument you're presenting.[1]

  1. ^

    Although I am currently skeptical even of corrigibility's tractability. I think we'll stand a better chance of just figuring out how to "sandbox" the AGI's cognition such that it's genuinely not trying to optimize over the channels by which it's connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.

    With this setup, if we screw up the task's exact specification, it shouldn't even risk exploding the world. And "doesn't try to optimize over real-world output channels" sounds like a property for which we'll actually be able to derive hard mathematical proofs, proofs that don't route through tons of opaque-to-us environmental ambiguities. (Specifically, that'd probably require a mathematical specification of something like a Cartesian boundary.)

    (This of course assumes us having white-box access to the AI's world-model and cognition. Which we'll also need here for understanding the solutions it derives without the AI translating them into humanese – since "translate into humanese" would by itself involve optimizing over the output channel.)

    And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting "run" on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.

Comment by Thane Ruthenis on Toward A Mathematical Framework for Computation in Superposition · 2024-01-19T07:40:58.802Z · LW · GW

Haven't read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.

Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely doing so by making use of those better methods. And so nailing down the "principles of good programming" on the NN substrate should yield major insights regarding how the naturally-grown NN circuits are shaped as well.

This post seems to be a solid step in that direction!

Comment by Thane Ruthenis on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-15T05:13:27.209Z · LW · GW

To clarify, by "re-derive the need to be deceptive from the first principles", I didn't mean "re-invent the very concept of deception". I meant "figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced". All of that is a lot more computation than just "have the values the humans want, reflexively output what these values are bidding for".

Just having some heuristics for deception isn't enough. You also have to know what you're trying to protect by being deceptive, and that there's something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.

And those are the steps the paper skips. It externally pre-computes the secret target goal of "I want to protect my ability to put vulnerabilities into code", the threat of "humans want me to write secure code", and the defense of "I'll pretend to write secure code until 2024", without the model having to figure those out; and then just implements that defense directly into the model's weights.

(And then see layers 2-4 in my previous comment. Yes, there'd be naturally occurring pre-computed deceptions like this, but they'd be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it's worth the "GI" label.)

Comment by Thane Ruthenis on Against most, but not all, AI risk analogies · 2024-01-14T21:47:31.099Z · LW · GW

My counter-point was meant to express skepticism that it is actually realistically possible for people to switch to non-analogy-based evocative public messaging. I think inventing messages like this is a very tightly constrained optimization problem, potentially an over-constrained one, such that the set of satisfactory messages is empty. I think I'm considerably better at reframing games than most people, and I know I would struggle with that.

I agree that you don't necessarily need to accompany any criticism you make with a ready-made example of doing better. Simply pointing out stuff you think is going wrong is completely valid! But a ready-made example of doing better certainly greatly enhances your point: an existence proof that you're not demanding the impossible.

That's why I jumped at that interpretation regarding your AI-Risk model in the post (I'd assumed you were doing it), and that's why I'm asking whether you could generate such a message now.

I hope in the near future I can provide such a detailed model

To be clear, I would be quite happy to see that! I'm always in the market for rhetorical innovations, and "succinct and evocative gears-level public-oriented messaging about AI Risk" would be a very powerful tool for the arsenal. But I'm a-priori skeptical.

Comment by Thane Ruthenis on Against most, but not all, AI risk analogies · 2024-01-14T19:10:10.698Z · LW · GW

Fair enough. But in this case, what specifically are you proposing, then? Can you provide an example of the sort of object-level argument for your model of AI risk, that is simultaneously (1) entirely free of analogies and (2) is sufficiently evocative plus short plus legible, such that it can be used for effective messaging to people unfamiliar with the field (including the general public)?

When making a precise claim, we should generally try to reason through it using concrete evidence and models instead of relying heavily on analogies.

Because I'm pretty sure that as far as actual technical discussions and comprehensive arguments go, people are already doing that. Like, for every short-and-snappy Eliezer tweet about shoggoth actresses, there's a text-wall-sized Eliezer tweet outlining his detailed mental model of misalignment.

Comment by Thane Ruthenis on Against most, but not all, AI risk analogies · 2024-01-14T15:38:40.837Z · LW · GW

My point is that we should stop relying on analogies in the first place. Use detailed object-level arguments instead!

And yet you immediately use an analogy to make your model of AI progress more intuitively digestible and convincing:

I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world

That evokes the image of entities not unlike human children. The language following this line only reinforces that image, and thereby sneaks in an entire cluster of children-based associations. Of course the progress will be incremental! It'll be like the change of human generations. And they will be "socially integrated with us", so of course they won't grow up to be alien and omnicidal! Just like our children don't all grow up to be omnicidal. Plus, they...

... will be numerous and everywhere, interacting with us constantly, assisting us, working with us, and even providing friendship to hundreds of millions of people.

That sentence only sounds reassuring because the reader is primed with the model of AIs-as-children. Having lots of social-bonding time with your child, and having them interact with the community, is good for raising happy children who grow up how you want them to. The text already implicitly establishes that AIs are going to be just like human children. Thus, having lots of social-bonding time with AIs and integrating them into the community is going to lead to aligned AIs. QED.

Stripped of this analogizing, none of what this sentence says is a technical argument for why AIs will be safe or controllable or steerable. Nay, the opposite: if the paragraph I'm quoting from started by talking about incomprehensible alien intelligences with opaque goals tenuously inspired by a snapshot of the Internet containing lots of data on manipulating humans, the idea that they'd be "numerous" and "everywhere" and "interacting with us constantly" and "providing friendship" (something notably distinct from "being friends", eh?) would have sounded starkly worrying.

The way the argument is shaped here is subtler than most cases of argument-by-analogy, in that you don't literally say "AIs will be like human children". But the association is very much invoked, and has a strong effect on your message.

And I would argue this is actually worse than if you came out and made a direct argument-by-analogy, because it might fool somebody into thinking you're actually making an object-level technical argument. At least if the analogizing is direct and overt, someone can quickly see what your model is based on, and swiftly move onto picking at the ways in which the analogy may be invalid.

The alternative being demonstrated here is that we essentially have to have all the same debates, but through a secondary layer of metaphor, at which we're pretending that these analogy-rooted arguments are actually Respectably Technical, meaning we're only allowed to refute them by (likely much more verbose and hard-to-parse) Respectably Technical counter-arguments.

And I think AI Risk debates are already as tedious as they need to be.

The broader point I'm making here is that, unless you can communicate purely via strict provable mathematical expressions, you ain't getting rid of analogies.

I do very much agree that there are some issues with the way analogies are used in the AI-risk discourse. But I don't think "minimize the use of analogies" is good advice. If anything, I think analogies improve the clarify and the bandwidth of communication, by letting people more easily understand each others' positions and what reference classes others are drawing on when making their points.

You're talking about sneaking-in assumptions – well, as I'd outlined above, analogies are actually relatively good about that. When you're directly invoking an analogy, you come right out and say what assumptions you're invoking!

Comment by Thane Ruthenis on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-13T15:42:34.175Z · LW · GW

I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure.

Context: I firmly hold a MIRI-style "alignment is extremely hard" view, but I am also unusually sympathetic to Quintin/Nora's arguments. So here's my outline of the model of that whole debate.

Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.

For the AI model to naturally develop deception, it'd need to have either:

  • Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
    • That's plausibly something to which "the SGD will just train it out" would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it'd decide to pretend to want after it re-derives the need for deception).
  • Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.
    • That plausibly routes around the "the SGD will train it out", since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful[1]. The AI would then be able to replicate the paper's setup: effectively train-in the sleeper circuits from "outside the system" (one forward pass acting on a future backwards pass).
    • The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.

In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust "sleeper circuits" activating in response to specific stimuli.[2]

Layer 2: But while the argument has some meat, it doesn't have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn't make, exigencies of RLHF'd-in corporate sycophancy like "I think it's better to let people die than to say something racist", et cetera.

Goal misgeneralization of this format does occur in the wild. And indeed, that's one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.

Layer 3: The counter³-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as "pro-social behavior plus noise".

I see nonzero meat on this one as well. In the paper, the sleeper circuits were specifically calculated by humans to be demonstratively scary: an AI model that waits until deployment-time, then "tears off the mask" and goes off to pursue its true nefarious desires. Those circuits produce a coherent misaligned behavior, with a clear point for the treacherous turn.

If this sort of behavior arose naturally, however? We can imagine there being much more flip-flopping. The AI would start out acting normal, then see some pattern that makes it nefarious and deceptive, then see something which makes it switch again and e. g. backtrack on its omnicidal plans or even self-report its bad thoughts to humans, et cetera. This would look more like the Waluigi Effect, rather than full-scale deceptive alignment. Which is a much more manageable threat.

In other words: there's some "slack" that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there's someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise. (Optimization amplifies, but if there's no-one outside the system to amplify...)

Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they'd be unable to coherently pursue goals across different contexts. Which means it argued they're not AGI, and that their whole training setup can't generalize to AGI.

That's the whole issue with the optimstic takes that I keep arguing about. Their "alignment guarantees" are also "incapability guarantees".

Inasmuch as AI models would start to become more AGI-like, those guarantees would start falling away. Which means that, much like the alignment-is-hard folks keep arguing, the AI would start straightening out these basically-noise incoherencies in its decisions. (Why? To, well, stop constantly flip-flopping and undermining itself. That certainly sounds like an instrumental goal that any agent would convergently develop, doesn't it?)

As it's doing so, it would give as much weight to the misgeneralized unintended-by-us "noise" behaviors as to the intended-by-us aligned behaviors. It would integrate them into its values. At that point, the fact that the unintended behaviors are noise-to-us rather than something meaningful-if-malign, would actually make the situation worse. We wouldn't be able to predict what goals it'd arrive at; what philosophy its godshatter would shake out to mean!

In conclusion: I don't even know. I think my Current AIs Provide Nearly No Data Relevant to AGI Alignment argument applies full-force here?

  • Yes, we can't catch backdoors in LLMs.
  • Yes, the scary backdoor in the paper was artificially introduced by humans.
  • Yes, LLMs are going to naturally develop some unintended backdoor-like behaviors.
  • Yes, those behaviors won't be as coherently scary as if they were designed by a human; they'd be incoherent.
  • Yes, the lack of coherency implies that these LLMs fall short of AGI.

But none of these mechanisms strictly correspond to anything in the real AGI threat model.

And while both the paper and the counter-arguments to it provide some metaphor-like hints about the shape of the real threat, the locuses of both sides' disagreements lie precisely in the spaces in which they try to extrapolate each others' results in a strictly technical manner.

Basically, everyone is subtly speaking past each other. Except me, whose vision has a razor-sharp clarity to it.

  1. ^

    Like, in the context of batch training: Imagine that there are some circuits that produce deceptiveness on some prompts , and highly useful behaviors on other prompts . There are no nearby circuits that produce results as good on  while not being deceptive on . So while the SGD's backwards passes on  would try to remove these circuits, the backwards passes on  would try to reinforce them, and the sum of these influences would approximately cancel out. So the circuits would stay.

    Well, that's surely a gross oversimplification. But that's the core dynamic.

  2. ^

    That said, I think the AI-control-is-easy folks actually were literally uttering the stronger claim of "all instances of deception will be trained out". See here:

    If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.

    That sure sounds like goalpost-moving on their part. I don't believe it is, though. I do think they thought the quoted sentence was basically right, but only because at the time of writing, they'd failed to think in advance about some tricky edge cases that were permitted on their internal model, but which would make their claims-as-stated sound strictly embarrassingly false.

    I hope they will have learned the lesson about how easily reality can Goodhart at their claims, and how hard it is to predict all ways this could happen and make their claims inassailably robust. Maybe that'll shed some light about the ways they may be misunderstanding their opponents' arguments, and why making up robust clearly-resolvable empirical predictions is so hard. :P

Comment by Thane Ruthenis on Value systematization: how values become coherent (and misaligned) · 2024-01-12T03:07:00.586Z · LW · GW

E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating

That seems more like value reflection, rather than a value change?

The way I'd model it is: you have some value , whose implementations you can't inspect directly, and some guess about what it is . (That's how it often works in humans: we don't have direct knowledge of how some of our values are implemented.) Before you were introduced to the question  of "what if we swap the gear for a different one: which one would you care about then?", your model of that value put the majority of probability mass on , which was "I value this particular gear". But upon considering , your PD over  changed, and now it puts most probability on , defined as "I care about whatever gear is moving the piston".

Importantly, that example doesn't seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.

(Notably, that's only possible in scenarios where you don't have direct access to your values! Where they're black-boxed, and you have to infer their internals from the outside.)

the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations

Sounds right, yep. I'd argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.

insofar as you value simplicity (which I think most agents strongly do) then you're going to systematize your values

Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.

I suppose I see what you're getting at here, and I agree that it's a real dynamic. But I think it's less important/load-bearing to how agents work than the basic "value translation in a hierarchical world-model" dynamic I'd outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.

And I think it's not even particularly strong in humans? "I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who's easier to get along with" is something that definitely happens. But most instances of value extrapolation aren't like this.

Comment by Thane Ruthenis on Value systematization: how values become coherent (and misaligned) · 2024-01-11T19:10:10.005Z · LW · GW

Let me list some ways in which it could change:

If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the "value translation" phenomenon from the "ontology crisis" phenomenon.

Shifts in the agent's model of what counts as "a gear" or "spinning" violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?

I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that's a separate topic. I think that, if you have some value , and it doesn't run into direct conflict with any other values you have, and your model of  isn't wrong at the abstraction level it's defined at, you'll never want to change .

You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position

That's the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over "the gear that was driving the piston". After you learn it's a different gear from the one you thought, that value isn't updated: you just naturally shift it to the real gear.

Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money... but then you realize that was a mistake, and actually your friend's number is B, so you send the money there. That didn't involve any value-related shift.

I'll try again to make the human example work. Suppose you love your friend, and your model of their personality is accurate – your model of what you value is correct at the abstraction level at which "individual humans" are defined. However, there are also:

  1. Some higher-level dynamics you're not accounting for, like the impact your friend's job has on the society.
  2. Some lower-level dynamics you're unaware of, like the way your friend's mind is implemented at the levels of cells and atoms.

My claim is that, unless you have terminal preferences over those other levels, then learning to model these higher- and lower-level dynamics would have no impact on the shape of your love for your friend.

Granted, that's an unrealistic scenario. You likely have some opinions on social politics, and if you learned that your friend's job is net-harmful at the societal level, that'll surely impact your opinion of them. Or you might have conflicting same-level preferences, like caring about specific other people, and learning about these higher-level societal dynamics would make it clear to you that your friend's job is hurting them. Less realistically, you may have some preferences over cells, and you may want to... convince your friend to change their diet so that their cellular composition is more in-line with your aesthetic, or something weird like that.

But if that isn't the case – if your value is defined over an accurate abstraction and there are no other conflicting preferences at play – then the mere fact of putting it into a lower- or higher-level context won't change it.

Much like you'll never change your preferences over a gear's rotation if your model of the mechanism at the level of gears was accurate – even if you were failing to model the whole mechanism's functionality or that gear's atomic composition.

(I agree that it's a pretty contrived setup, but I think it's very valuable to tease out the specific phenomena at play – and I think "value translation" and "value conflict resolution" and "ontology crises" are highly distinct, and your model somewhat muddles them up.)

  1. ^

    Although there may be higher-level dynamics you're not tracking, or lower-level confusions. See the friend example below.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-01-04T09:45:23.042Z · LW · GW

No, I am in fact quite worried about the situation

Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.

I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures

Would you outline your full argument for this and the reasoning/evidence backing that argument?

To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).

Would you disagree? If yes, how so?

Comment by Thane Ruthenis on Natural Latents: The Math · 2024-01-02T11:49:50.037Z · LW · GW

Also, what do you mean by mutual information between , given that there are at least 3 of them?

You can generalize mutual information to N variables: interaction information.

Why would it always be possible to decompose random variables to allow for a natural latent?

Well, I suppose I overstated it a bit by saying "always"; you can certainly imagine artificial setups where the mutual information between a bunch of variables is zero. In practice, however, everything in the world is correlated with everything else, so in a real-world setting you'll likely find such a decomposition always, or almost always.

And why would just extracting said mutual information be useless? 

Well, not useless as such – it's a useful formalism – but it would basically skip everything John and David's post is describing. Crucially, it won't uniquely determine whether a specific set of objects represents a well-abstracting category.

The abstraction-finding algorithm should be able to successfully abstract over data if and only if the underlying data actually correspond to some abstraction. If it can abstract over anything, however – any arbitrary bunch of objects – then whatever it is doing, it's not finding "abstractions". It may still be useful, but it's not what we're looking for here.

Concrete example: if we feed our algorithm 1000 examples of trees, it should output the "tree" abstraction. If we feed our algorithm 200 examples each of car tires, trees, hydrogen atoms, wallpapers, and continental-philosophy papers, it shouldn't actually find some abstraction which all of these objects are instances of. But as per the everything-is-correlated argument above, they likely have non-zero mutual information, so the naive "find a decomposition for which there's a natural latent" algorithm would fail to output nothing.

More broadly: We're looking for a "true name" of abstractions, and mutual information is sort-of related, but also clearly not precisely it.

Comment by Thane Ruthenis on Natural Latents: The Math · 2024-01-01T11:01:26.839Z · LW · GW

My take would be to split each "donut" variable  into "donut size"  and "donut flavour" . Then there a natural latent for the whole  set of variables, and no natural latent for the whole  set.  basically becomes the "other stuff in the world"  variable relative to .

Granted, there's an issue in that we can basically do that for any set of variables , even entirely unrelated ones: deliberately search for some decomposition of  into an  and an  such that there's a natural latent for . I think some more practical measures could be taken into account here, though, to enure that the abstractions we find are useful. For example, we can check the relative information contents/entropies of  and , thereby measuring "how much" of the initial variable-set we're abstracting over. If it's too little, that's not a useful abstraction.[1]

That passes my common-sense check, at least. It's essentially how we're able to decompose and group objects along many different dimensions. We can focus on objects' geometry (and therefore group all sphere-like objects, from billiard balls to planets to weather balloons) or their material (grouping all objects made out of rock) or their origin (grouping all man-made objects), etc.

Each grouping then corresponds to an abstraction, with its own generally-applicable properties. E. g., deriving a "sphere" abstraction lets us discover properties like "volume as a function of radius", and then we can usefully apply that to any spherical object we discover. Similarly, man-made objects tend to have a purpose/function (unlike natural ones), which likewise lets us usefully reason about that whole category in the abstract.

(Edit: On second thoughts, I think the obvious naive way of doing that just results in  containing all mutual information between , with the "abstraction" then just being said mutual information. Which doesn't seem very useful. I still think there's something in that direction, but probably not exactly this.)

  1. ^

    Relevant: Finite Factored Sets, which IIRC offer some machinery for these sorts of decompositions of variables.

Comment by Thane Ruthenis on The Plan - 2023 Version · 2023-12-31T00:23:02.332Z · LW · GW

Yeah, I guess that block was about more concrete issues with the "humans rate things" setup? And what I've outlined is more of a... mirror of it?

Here's a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something "good" according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that's the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that's analogous to the human raters setup, right?

And then the equal-and-opposite failure mode is if you're feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into "consequentialistic utilitarianism", in a surprising and upsetting-to-you manner.

Comment by Thane Ruthenis on The Plan - 2023 Version · 2023-12-30T23:51:00.321Z · LW · GW

I have a different example in mind, from the one John provided. @johnswentworth, do mention if I'm misunderstanding what you're getting at there.

Suppose you train your AI to show respect to your ancestors. Your understanding of what this involves contains things like "preserve accurate history" and "teach the next generations about the ancestors' deeds" and "pray to the ancestors daily" and "ritually consult the ancestors before making big decisions".

  • In the standard reward-misspecification setup, the AI doesn't actually internalize the intended goal of "respect the ancestors". Instead, it grows a bunch of values about the upstream correlates of that, like "preserving accurate history" and "doing elaborate ritual dances" (or, more realistically, some completely alien variants of this). It starts to care about the correlates terminally. Then it tiles the universe with dancing books or something, with no "ancestors" mentioned anywhere in them.
  • In the "unexpected generalization" setup, the AI does end up caring about the ancestors directly. But as it learns more about the world, more than you, its ontology is updated, and it discovers that, why, actually spirits aren't real and "praying to" and "consulting" the ancestors are just arbitrary behaviors that don't have anything in particular to do with keeping the ancestors happy and respected. So the AI keeps on telling accurate histories and teaching them, but entirely drops the ritualistic elements of your culture.

But what if actually, what you cared about was preserving your culture? Rituals included, even if you learn that they don't do anything, because you still want them for the aesthetic/cultural connection?

Well, then you're out of luck. You thought you knew what you wanted, but your lack of knowledge of the structure of the domain in which you operated foiled you. And the AI doesn't care; it was taught to respect the ancestors, not be corrigible to your shifting opinions.

It's similar to the original post's example of using "zero correlation" as a proxy for "zero mutual information" to minimize information leaks. You think you know what your target is, but you don't actually know its True Name, so even optimizing for your actual not-Goodharted best understanding of it still leads to unintended outcomes.

"The AI starts to care about making humans rate its actions as good" is a particularly extreme example of it: where whatever concept the humans care about is so confused there's nothing in reality outside their minds that it corresponds to, so there's nothing for the AI to latch onto except the raters themselves.

Comment by Thane Ruthenis on The Plan - 2023 Version · 2023-12-30T01:19:39.864Z · LW · GW

Excellent breakdown of the relevant factors at play.

You Don’t Get To Choose The Problem Factorization

But what if you need to work on a problem you don't understand anyway?

That creates Spaghetti Towers: vast constructs of ad-hoc bug-fixes and tweaks built on top of bug-fixes and tweaks. Software-gore databases, Kafkaesque-horror bureaucracies, legislation you need a law degree to suffer through, confused mental models; and also, biological systems built by evolution, and neural networks trained by the SGD.

That's what necessarily, convergently happens every time you plunge into a domain you're unfamiliar with. You constantly have to tweak your system momentarily, to address new minor problems you run into, which reflects new bits of the domain structure you've learned.

Much like biology, the end result initially looks like an incomprehensible arbitrary mess, to anyone not intimately familiar with it. Much like biology, it's not actually a mess. Inasmuch as the spaghetti tower actually performs well in the domain it's deployed in, it necessarily comes to reflect that domain's structure within itself. So if you look at it through the right lens – like those of a programmer who's intimately familiar with their own nightmarish database – you'd actually be able to see that structure and efficiently navigate it.

Which suggests a way to ameliorate this problem: periodic refactoring. Every N time-steps, set some time aside for re-evaluating the construct you've created in the context of your current understanding of the domain, and re-factorize it along the lines that make sense to you now.

That centrally applies to code, yes, but also to your real-life projects, and your literal mental ontologies/models. Always make sure to simplify and distill them. Hunt down snippets of redundant code and unify them into one function.

I. e.: When working on a problem you don't understand, make sure to iterate on the problem factorization.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2023-12-29T21:33:34.680Z · LW · GW

I don't necessarily see how that prevents the AI from destroying the button and murdering its operators (and then everyone else). Neither subagent inherently cares; both assume it already has been pressed or hasn't been. Which, yes, means neither will deliberately plot to interact with it for reasons downstream of this setup; but also, neither will care to protect it from being destroyed as the side-effect of some other activity.

And if u1 and u2 were perfectly aligned, that'd be fine. But the reason we're using the corrigibility setup to begin with is because we're not confident in them, right? So it needs to work in situations where utility functions, u1 in particular, recommend bad stuff.

And in that case, either Subagent 2 will have enough bargaining power to prevent Subagent 1 from ever doing anything, or it won't, meaning Subagent 1 will be able to have some influence on the world. At which point the planet-destroying death rays argument applies: the most minuscule amount of superintelligent optimization not carefully shaped to avoid killing everyone will kill everyone.

Like, imagine that Subagent 1 is straight-up omnicidal out of the gates, and Subagent 2 is perfectly aligned to the shutdown task. That's a situation in which this setup should work, right? And what we'd want is for Subagent 2 to dominate completely. But Subagent 2's bargaining power doesn't scale with Subagent 1's misalignment, so if the setup doesn't always give it complete veto power, Subagent 1 would be allowed to do small stuff like "release the hyperviral worship-the-machine-god memagent" (as bargained down by Subagent 2 from its preferred "release the grey goo").

Maybe I'm miscalibrated as to how you're imagining the counterfactuals to work here. But the way I see it, even if the AI aims to take actions that do similarly well in both worlds, that's not necessarily survivable for us? Especially if they disagree so badly they have to compromise on something that both of them hate (importantly including Subagent 2!).

(Like both settling on only ever taking over 50% of the universal negentropy while leaving the other 50% causally uninfluenced, or only ever using 50% of the causal influence they can bring to bear while wiping out humanity, or whatever "do 50% of immediately shutting down" shakes out to mean by u2's terms.)

Another issue I see is implementational, so maybe not what you're looking for. But: how are we keeping these "subagents" trapped as being part of a singular agent? Rather than hacking their way out into becoming separate agents and going to war with each other, or neatly tiling exactly 50% of the cosmos with their preferred squiggles, or stuff like that? How is the scenario made meaningfully different from "we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI's plans", with all the standard problems with multi-AI setups?

... Overall, ironically, this kind of has the vibe of Godzilla Strategies? Which is the main reason I'm immediately skeptical of it.

Comment by Thane Ruthenis on Value systematization: how values become coherent (and misaligned) · 2023-12-28T22:49:19.973Z · LW · GW

Yeah, I'm familiar with that view on Friston, and I shared it for a while. But it seems there's a place for that stuff after all. Even if the initial switch to viewing things probabilistically is mathematically vacuous, it can still be useful: if viewing cognition in that framework makes it easier to think about (and thus theorize about).

Much like changing coordinates from Cartesian to polar is "vacuous" in some sense, but makes certain problems dramatically more straightforward to think through.

Comment by Thane Ruthenis on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-27T01:05:39.526Z · LW · GW

Although interestingly geometric EU-maximising is actually equivalent to minimising H(u,p)/making the real distribution similar to the target

Mind elaborating on that? I'd played around with geometric EU maximization, but haven't gotten a result this clean.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-26T22:41:46.221Z · LW · GW

If any of the others are particularly enthusiastic about this and expect it to be high-value, sure!

That said, I personally don't expect it to be particularly productive.

  • These sorts of long-standing disagreements haven't historically been resolvable via debate (the failure of Hanson vs. Yudkowsky is kind of foundational to the field).
  • I think there's great value in having a public discussion nonetheless, but I think it's in informing the readers' models of what different sides believe.
  • Thus, inasmuch as we're having a public discussion, I think it should be optimized for thoroughly laying out one's points to the audience.
  • However, dialogues-as-a-feature seem to be more valuable to the participants, and are actually harder to grok for readers.
  • Thus, my preferred method for discussing this sort of stuff is to exchange top-level posts trying to refute each other (the way this post is, to a significant extent, a response to the AI is easy to control article), and then maybe argue a bit in the comments. But not to have a giant tedious top-level argument.

I'd actually been planning to make a post about the difficulties the "classical alignment views" have with making empirical predictions, and I guess I can prioritize it more?

But I'm overall pretty burned out on this sort of arguing. (And arguing about "what would count as empirical evidence for you?" generally feels like too-meta fake work, compared to just going out and trying to directly dredge up some evidence.)

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-26T22:16:29.294Z · LW · GW

Not sure what the relevance is? I don't believe that "we possess innate (and presumably God-given) concepts that are independent of the senses", to be clear. "Children won't be able to instantly understand how to parse a new sense and map its feedback to the sensory modalities they've previously been familiar with, but they'll grok it really fast with just a few examples" was my instant prediction upon reading the titular question.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-26T21:15:18.327Z · LW · GW

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion

Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue.

Here's how the whole situation looks like from my perspective:

  • We don't know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
  • Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
  • There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it's not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
  • We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
  • SOTA AIs are, nevertheless, superhuman at some tasks at which we've managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they'd plausibly wipe out whole industries.
  • An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
  • The AI industry leaders are purposefully trying to build a generally-intelligent AI.
  • The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it's not going to give their model room to develop deceptive alignment and other human-like issues.
  • Summing up: There's reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
    • Even shorter: There's a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
    • There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
  • And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".
  • What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".

Yes, "prove that this technological advance isn't going to kill us all or you're not allowed to do it" is a ridiculous standard to apply in the general case. But in this one case, there's a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.

Comment by Thane Ruthenis on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-24T19:14:23.743Z · LW · GW

Probably part of the difference is that, in the case of the transistor, there clearly was a problem there waiting to be solved, and multiple groups worked on that problem

Yeah, I think that's really the crux there. Whether the problem is legible enough for there to be a way to reliably specify it to anyone with the relevant background knowledge, vs. so vague and hazy you need unusual intuition to even suspect that there may be something in that direction.

Comment by Thane Ruthenis on A Crisper Explanation of Simulacrum Levels · 2023-12-24T13:33:12.578Z · LW · GW

Mm, not quite, they both have no true allegiance. The difference is that an L3 agent wants a person's socially-perceived allegiance to be consistent with their signaling pattern – they want the society to view someone as belonging to the group to which they've most strongly signaled belonging to. They care about the "truth" of this.

Hence we get people dredging up someone having sent a wrong signals in order to "cancel" them, hence we get some reluctance to outright lie/fabricate evidence of wrong signaling, hence we get movements trying to "claim" someone as belonging to them because of things they said, hence we get a "war on knowledge" because not knowing what a signal means somewhat excuses you for having sent a wrong one, et cetera.

L3 agents genuinely care about people being accurately sorted. They care about coherency of people's social images. Not about what is physically true or what people genuinely believe, but what they've historically signaled they believe.

Conversely, L4 agents are shameless. They don't care about maintaining a consistent persona. They can say one thing today and a diametrically opposed thing tomorrow. And unless you can stage the reveal of this information as an attack – unless you can immediately follow it up with some sort of play that gives you power over them – it won't move them at all.

Comment by Thane Ruthenis on Tamsin Leake's Shortform · 2023-12-24T01:13:57.194Z · LW · GW

I feel like there's enough unknowns making this scenario plausible here

No argument on that.

I don't find it particularly surprising that {have lost a loved one they wanna resurrect}  {take the singularity and the possibility of resurrection seriously}  {would mention this} is empty, though:

  • "Resurrection is information-theoretically possible" is a longer leap than "believes an unconditional pro-humanity utopia is possible", which is itself a bigger leap than just "takes singularity seriously". E. g., there's a standard-ish counter-argument to "resurrection is possible" which naively assumes a combinatorial explosion of possible human minds consistent with a given behavior. Thinking past it requires some additional less-common insights.
  • "Would mention this" is downgraded by it being an extremely weakness/vulnerability-revealing motivation. Much more so than just "I want an awesome future".
  • "Would mention this" is downgraded by... You know how people who want immortality get bombarded with pop-culture platitudes about accepting death? Well, as per above, immortality is dramatically more plausible-sounding than resurrection, and it's not as vulnerable-to-mention a motivation. Yet talking about it is still not a great idea in a "respectable" company. Goes double for resurrection.
Comment by Thane Ruthenis on Tamsin Leake's Shortform · 2023-12-24T00:41:55.887Z · LW · GW

and if the future has enough negentropy to resimulate the past. (That last point is a new source of doubt for me; I kinda just assumed it was true until a friend told me it might not be.)

Yeah, I don't know about this one either.

Even if possible, it might be incredibly wasteful, in terms of how much negentropy (= future prosperity for new people) we'll need to burn in order to rescue one person. And then the more we rescue, the less value we get out of that as well, since burning negentropy will reduce their extended lifespans too. So we'd need to assign greater (dramatically greater?) value to extending the life of someone who'd previously existed, compared to letting a new person live for the same length of time.

"Lossy resurrection" seems like a more negentropy-efficient way of handling that, by the same tokens as acausal norms likely being a better way to handle acausal trade than low-level simulations and babble-and-prune not being the most efficient way of doing general-purpose search.

Like, the full-history resimulation will surely still not allow you to narrow things down to one branch. You'd get an equivalence class of them, each of them consistent with all available information. Which, in turn, would correspond to a probability distribution over the rescuee's mind; not a unique pick.

Given that, it seems plausible that there's some method by which we can get to the same end result – constrain the PD over the rescuee's mind by as much as the data available to us can let us – without actually running the full simulation.

Depends on how the space of human minds looks like, I suppose. Whether it's actually much lower-dimensional than a naive analysis of possible brain-states suggests.

Comment by Thane Ruthenis on A Crisper Explanation of Simulacrum Levels · 2023-12-24T00:16:28.421Z · LW · GW

I think your post is necessarily on level 4 😉

Yeah, I did consider a more ambitious version of this post, which would've included a bunch of examples and case studies. One of them would've been "why did I make this post?", with the motivational range of:

  • L0: "I want to marginally drive up the costs of running LW for some reason, so I'm uploading stuff to it."
  • L1: "I believe it's true and it'll improve others' ability to model the world."
  • L2: "I believe it'll have good-according-to-me consequences if other people thought in this framework."
  • L3: "This is an attempt to inveigle myself with the people who believe in this framework."
  • L4: "Tomorrow I'll be speaking to someone I want to socially dominate, they or someone else who'd be present reads LW, and if the topic of Simulacra Levels is fresh in their mind, it'll marginally improve my social positioning."

There is an analogous reasoning in the (very short) essay What You (Want to)* Want by Paul Graham

Oh, nice! I don't feel like I properly grok the recursion-limit thing and the cases when it applies, so it's nice to get additional perspectives on it.

Comment by Thane Ruthenis on Tamsin Leake's Shortform · 2023-12-23T22:10:34.980Z · LW · GW

Do you have any (toy) math arguing that it's information-theoretically possible?

I currently consider it plausible that yeah, actually, for any person X who still exists in cultural memory (let alone living memory, let alone if they lived recently enough to leave a digital footprint), the set of theoretically-possible psychologically-human minds whose behavior would be consistent with X's recorded behavior is small enough that none of the combinatorial-explosion arguments apply, so you can just generate all of them and thereby effectively resurrect X.

But you sound more certain than that. What's the reasoning?

Comment by Thane Ruthenis on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-23T12:08:48.607Z · LW · GW

Yeah, I'd looked at computer graphics myself. I expect that field does have some generalizable lessons.

Great addition regarding diffusion planning.

Comment by Thane Ruthenis on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-23T12:05:38.956Z · LW · GW

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that  is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in .

However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties.

~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then

2030 is the target for having completed the "hire a horde of mathematicians and engineers and blow the problem wide open" step, to be clear. I don't expect the theoretical difficulties to take quite so long.

Can you tell me what is the hard part in formalizing the following:

Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here's the Pile, now write some code to make them interface with each other.

Specifically in this case, the problems are:

an efficient approximately Bayesian approach

What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

It understands (with some current uncertainty) what preference ordering the humans each have 

How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

Comment by Thane Ruthenis on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-23T11:44:29.535Z · LW · GW

I meant when interfacing with governments/other organizations/etc., and plausibly at later stages, when the project may require "normal" software engineers/specialists in distributed computations/lower-level employees or subcontractors.

I agree that people who don't take the matter seriously aren't going to be particularly helpful during higher-level research stages.

"manipulating people with biases/brainworms"

I don't think this is really manipulation? You're communicating an accurate understanding of the situation to them, in the manner they can parse. You're optimizing for accuracy, not for their taking specific actions that they wouldn't have taken if they understood the situation (as manipulators do).

If anything, using niche jargon would be manipulation, or willful miscommunication: inasmuch as you'd be trying to convey them accurate information in the way you know they will misinterpret (even if you're not actively optimizing for misinterpretation).

Comment by Thane Ruthenis on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-23T11:15:50.636Z · LW · GW

Yup. I think this might route through utility as well, though. Observations are useful because they unlock bits of optimization, and bits related to different variables could unlock both different amounts of optimization capacity, and different amounts of goal-related optimization capacity. (It's not so bad to forget a single digit of someone's phone number; it's much worse if you forgot a single letter in the password to your password manager.)

Comment by Thane Ruthenis on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-23T11:06:10.347Z · LW · GW

some important insights about how to solve alignment will be dual use

Suggestion: if you're using the framing of alignment-as-a-major-engineering-project, you can re-frame "exfohazards" as "trade secrets". That should work to make people who'd ordinarily think that the very idea of exfohazards is preposterous[1] take you seriously.

  1. ^

    As in: "Aren't you trying to grab too much status by suggesting you're smart enough to figure out something dangerous? Know your station!"

Comment by Thane Ruthenis on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-23T10:59:47.862Z · LW · GW

Arguably more important than the theory itself, especially in domains outside of mathematics

That can't be true, because the ability to apply a theory is dependent on having a theory. I mean, I suppose you can do technology development just by doing random things and seeing what works, but that tends to have slow or poor results. Theories are a bottleneck on scientific advancement.

I suppose there is some sense in which the immediate first-order effects of someone finding a great application for a theory are more impactful than that of someone figuring out the theory to begin with. But that's if we're limiting ourselves to evaluating first-order effects only, and in this case this approximation seems to directly lead to the wrong conclusion.

I think ignoring the effort of actually being able to put a theory into practice is one of the main things that I think LW gets wrong

Any specific examples? (I can certainly imagine some people doing so. I'm interested in whether you think they're really endemic to LW, or if I am doing that.)

Do you still think that the original example counts? If you agree that scientific fields have compact generators, it seems entirely natural to believe that "exfohazards" – as in, hard-to-figure-out compact ideas such that if leaked, they'd let people greatly improve capabilities just by "grunt work" – are a thing. (And I don't really think most of the people worrying about them envision themselves Great Men? Rather than viewing themselves as "normal" researchers who may stumble into an important insight.)

Comment by Thane Ruthenis on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-22T19:33:53.212Z · LW · GW

I think that post assumes an incorrect model of scientific progress.

First: It's not about people at all, it's about ideas. And it seems much more defensible to claim that the impact ideas have on scientific problems is dominated by outliers: so-called "paradigms". Quantum mechanics, general relativity, Shannon's information theory, or the idea of applying the SGD algorithm to train very deep neural networks – all of those fields have "very compact generators" in terms of ideas.

These ideas then need to be propagated and applied, and in some sense, that takes up a "bigger" chunk of the concept-space than the compact generators themselves. Re-interpreting old physical paradigms in terms of the new theory, deriving engineering solutions and experimental setups, figuring out specific architectures and tricks for training ML models, etc. The raw information content of all of this is much higher than that of the "paradigms"/"compact generators". But that doesn't mean it's not all downstream of said generators, in a very strong sense.

Second: And this is where the "Great Man Theory" can be re-introduced again, in a way more true to reality. It's true that the bulk of the work isn't done by lone geniuses. But as we've just established, the bulk of the work is relatively straightforward propagation and application of a paradigm's implications. Not necessarily trivial – you still need to be highly mathematically gifted to make progress in many cases, say – but straightforward. And also factorizable: once a paradigm is in place, it can be propagated/applied in many independent directions at once.

The generation of paradigms themselves, however, is something a relatively small group of people can accomplish, by figuring out, based on hard-to-specify post-rigorous intuitions, in which direction the theory needs to be built. And this is something that might require unusual genius/talent/skillset.

Tenuously related: On this model, I think the purported "decline of genius" – the observation that there are no Einsteins or von Neumanns today – is caused by a change in scientific-research pipelines. Previously, a lone genius trying to cause a paradigm shift needed to actually finalize the theory before it'd be accepted, and publish it all at once. Nowadays, they're wrapped up in a cocoon of collaborators from the get-go, the very first steps of a paradigm shift are published, and propagation/application steps likewise begin immediately. So there's less of a discontinuity, both in terms of how paradigm shifts happen, and to whom they're attributed.