Posts

A Crisper Explanation of Simulacrum Levels 2023-12-23T22:13:52.286Z
Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) 2023-12-22T20:19:13.865Z
Most People Don't Realize We Have No Idea How Our AIs Work 2023-12-21T20:02:00.360Z
How Would an Utopia-Maximizer Look Like? 2023-12-20T20:01:18.079Z
Don't Share Information Exfohazardous on Others' AI-Risk Models 2023-12-19T20:09:06.244Z
The Shortest Path Between Scylla and Charybdis 2023-12-18T20:08:34.995Z
A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans 2023-12-17T20:28:57.854Z
"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity 2023-12-16T20:08:39.375Z
Current AIs Provide Nearly No Data Relevant to AGI Alignment 2023-12-15T20:16:09.723Z
Hands-On Experience Is Not Magic 2023-05-27T16:57:10.531Z
A Case for the Least Forgiving Take On Alignment 2023-05-02T21:34:49.832Z
World-Model Interpretability Is All We Need 2023-01-14T19:37:14.707Z
Internal Interfaces Are a High-Priority Interpretability Target 2022-12-29T17:49:27.450Z
In Defense of Wrapper-Minds 2022-12-28T18:28:25.868Z
Accurate Models of AI Risk Are Hyperexistential Exfohazards 2022-12-25T16:50:24.817Z
Corrigibility Via Thought-Process Deference 2022-11-24T17:06:39.058Z
Value Formation: An Overarching Model 2022-11-15T17:16:19.522Z
Greed Is the Root of This Evil 2022-10-13T20:40:56.822Z
Are Generative World Models a Mesa-Optimization Risk? 2022-08-29T18:37:13.811Z
AI Risk in Terms of Unstable Nuclear Software 2022-08-26T18:49:53.726Z
Broad Picture of Human Values 2022-08-20T19:42:20.158Z
Interpretability Tools Are an Attack Channel 2022-08-17T18:47:28.404Z
Convergence Towards World-Models: A Gears-Level Model 2022-08-04T23:31:33.448Z
What Environment Properties Select Agents For World-Modeling? 2022-07-23T19:27:49.646Z
Goal Alignment Is Robust To the Sharp Left Turn 2022-07-13T20:23:58.962Z
Reframing the AI Risk 2022-07-01T18:44:32.478Z
Is This Thing Sentient, Y/N? 2022-06-20T18:37:59.380Z
The Unified Theory of Normative Ethics 2022-06-17T19:55:19.588Z
Towards Gears-Level Understanding of Agency 2022-06-16T22:00:17.165Z
Poorly-Aimed Death Rays 2022-06-11T18:29:55.430Z
Reshaping the AI Industry 2022-05-29T22:54:31.582Z
Agency As a Natural Abstraction 2022-05-13T18:02:50.308Z

Comments

Comment by Thane Ruthenis on Corrigibility = Tool-ness? · 2024-06-28T03:00:42.546Z · LW · GW

(Written while I'm at the title of "Respecting Modularity".)

My own working definition of "corrigibility" has been something like "an AI system that obeys commands, and only produces effects through causal pathways that were white-listed by its human operators, with these properties recursively applied to its interactions with its human operators".

In a basic case, if you tell it to do something, like "copy a strawberry" or "raise the global sanity waterline", it's going to give you a step-by-step outline of what it's going to do, how these actions are going to achieve the goal, how the resultant end-state is going to be structured (the strawberry's composition, the resultant social order), and what predictable effects all of this would have (both direct effects and side-effects).

So if it's planning to build some sort of nanofactory that boils the oceans as a side-effect, or deploy Basilisk hacks that exploit some vulnerability in the human psyche to teach people stuff, it's going to list these pathways, and you'd have the chance to veto them. Then you'll get it to generate some plans that work through causal pathways you do approve of, like "normal human-like persuasion that doesn't circumvent the interface of the human mind / doesn't make the abstraction "the human mind" leak / doesn't violate the boundaries of the human psyche".

It's also going to adhere to this continuously: e. g., if it discovers a new causal pathway and realizes the plan it's currently executing has effects through it, it's going to seek urgent approval from the human operators (while somehow safely halting its plan using a procedure for this that it previously designed with its human operators, or something).

And this should somehow apply recursively. The AI should only interact with the operators through pathways they've approved of. E. g., using only "mundane" human-like ways to convey information; no deploying Basilisk hacks to force-feed them knowledge, no directly rewriting their brains with nanomachines, not even hacking their phones to be able to talk to them while they're outside the office.

(How do we get around the infinite recursion here? I have no idea, besides "hard-code some approved pathways into the initial design".)

And then the relevant set of "causal pathways" probably factors through the multi-level abstract structure of the environment. For any given action, there is some set of consequences that is predictable and goes into the AI's planning. This set is relatively small, and could be understood by a human. Every consequence outside this "small" set is unpredictable, and basically devolves into high-entropy noise; not even an ASI could predict the outcome. (Think this post.) And if we look at the structure of the predictable-consequences sets across time, we'd find rich symmetries, forming the aforementioned "pathways" through which subsystems/abstractions interact.

(I've now read the post.)

This seems to fit pretty well with your definition? Visibility: check, correctability: check. The "side-effects" property only partly fits – by my definition, a corrigible AI is allowed to have all sorts of side-effects, but these side-effects must be known and approved by its human operator – but I think it's gesturing at the same idea. (Real-life tools also have lots of side effects, e. g. vibration and noise pollution from industrial drills – but we try to minimize these side-effects. And inasmuch as we fail, the resultant tools are considered "bad", worse than the versions of these tools without the side-effects.)

Comment by Thane Ruthenis on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-23T13:45:59.761Z · LW · GW

That was my interpretation as well.

I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can.

The "single-step" case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns building on implicit patterns, none of which are present in the training data, and which can become arbitrarily more sophisticated and arbitrarily more alien than anything in the training set. And we don't even know what they are, so we can't assess their safety, and can't even train them out (because we don't know how to elicit them).

Which, well, yes: we already knew all of this was happening. But I think this paper is very useful in clearly showcasing this.

One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).

Which implies some interesting things about how the representations are stored. The LLM actually "understands" what f(blah) is built out of, in a way that's accessible to its externalized monologue. That wasn't obvious to me, at least.

Comment by Thane Ruthenis on The Leopold Model: Analysis and Reactions · 2024-06-16T23:51:56.307Z · LW · GW

I believe Xi (or choose your CCP representative) would say that the ultimate goal is human flourishing

I'm very much worried that this sort of thinking is a severe case of Typical Mind Fallacy.

I think the main terminal values of the individuals constituting the CCP – and I do mean terminal, not instrumental – are the preservation of their personal status, power, and control, like the values of ~all dictatorships, and most politicians in general. Ideology is mostly just an aesthetics, a tool for internal and external propaganda/rhetoric, and the backdrop for internal status games.

There probably are some genuine shards of ideology in their minds. But I expect minuscule overlap between their at-face-value ideological messaging, and the future they'd choose to build if given unchecked power.

On the other hand, if viewed purely as an organization/institution, I expect that the CCP doesn't have coherent "values" worth talking about at all. Instead, it is best modeled as a moral-maze-like inertial bureaucracy/committee which is just replaying instinctive patterns of behavior.

I expect the actual "CCP" would be something in-between: it would intermittently act as a collection of power-hungry ideology-biased individuals, and as an inertial institution. I have no idea how this mess would actually generalize "off-distribution", as in, outside the current resource, technology, and power constraints. But I don't expect the result to be pretty.

Mind, similar holds for the USG too, if perhaps to a lesser extent.

Comment by Thane Ruthenis on On Dwarksh’s Podcast with Leopold Aschenbrenner · 2024-06-11T22:49:25.351Z · LW · GW

Maybe they develop mind control level convincing argument and send it to key people (president, congress, NORAD, etc) or hack their iPhones and recursively down to security guards of fabs/power plants/data centers/drone factories. That may be quick enough. The point is that it is not obvious.

That's the sort of thing that'd happen, yes. As with all AI takeover scenarios, it likely wouldn't go down like this specifically, but you can be sure that the ASI would achieve the goal it wants to achieve/was told to achieve if aligned. (And see this post for my model of how this class of concrete scenarios would actually look like.)

Having nukes is not really a good analogy for having an aligned ASI at your disposal, as far as taking over the world is concerned. Unless your terminal value is human extinction, you can't really nuke the world into the state of your personal utopia. You can't even use nukes as leverage to threaten people into building your utopia, because: 

  1. Some people are good enough at decision theory to ignore threats.
  2. Coercing people in this way might not actually be part of your utopia.
  3. Your "power" is brittle. You only have the threat of nuclear armageddon to fall back on, and you can still be defeated by e. g. clever infiltration and sabotage, or by taking over your supply chains, etc. (If you have overwhelming, utterly loyal military power and security in full generality, that's a very different setup.)

None of those constraints apply to having an ASI at your disposal. An ASI would let you implement your values upon the cosmos fully and faithfully, and it'd give you the roadmap to getting there from here.

This is also precisely why Leopold's talk of "checks and balances" as the reason why governments could be trusted with AGI falls apart. "The government" isn't some sort of holistic entity, it's a collection of individuals with their own incentives, sometimes quite monstrous incentives. In the current regime, it's indeed checked-and-balanced to be mostly sort-of (not really) aligned to the public good. But that property is absolutely not robust to you giving unchecked power to any given subsystem in it!

I'm really quite baffled that Leopold doesn't get this, given his otherwise excellent analysis of the "authoritarianism risks" associated with aligned ASIs in the hands of private companies and the CCP. Glad to see @Zvi pointing that out.

Comment by Thane Ruthenis on My AI Model Delta Compared To Yudkowsky · 2024-06-10T21:02:21.332Z · LW · GW

We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.

For context, I'm familiar with this view from the ELK report. My understanding is that this is part of the "worst-case scenario" for alignment that ARC's agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).

To quote:

The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.

We can also imagine a mismatch where AI systems use higher-level abstractions that humans lack, and are able to make predictions about observables without ever thinking about lower-level abstractions that are important to humans. For example we might imagine an AI making long-term predictions based on alien principles about memes and sociology that don’t even reference the preferences or beliefs of individual humans. Of course it is possible to translate those principles into predictions about individual humans, and indeed this AI ought to make good predictions about what individual humans say, but if the underlying ontology is very different we are at risk of learning the human simulator instead of the “real” mapping.

Overall we are by far most worried about deeply “messy” mismatches that can’t be cleanly described as higher- or lower-level abstractions, or even what a human would recognize as “abstractions” at all. We could try to tell abstract stories about what a messy mismatch might look like, or make arguments about why it may be plausible, but it seems easier to illustrate by thinking concretely about existing ML systems.

[It might involve heuristics about how to think that are intimately interwoven with object level beliefs, or dual ways of looking at familiar structures, or reasoning directly about a messy tapestry of correlations in a way that captures important regularities but lacks hierarchical structure. But most of our concern is with models that we just don’t have the language to talk about easily despite usefully reflecting reality. Our broader concern is that optimistic stories about the familiarity of AI cognition may be lacking in imagination. (We also consider those optimistic stories plausible, we just really don’t think we know enough to be confident.)]

So I understand the shape of the argument here.

... But I never got this vibe from Eliezer/MIRI. As I previously argued, I would say that their talk of different internal ontologies and alien thinking is mostly about, to wit, different cognition. The argument is that AGIs won't have "emotions", or a System 1/System 2 split, or "motivations" the way we understands them – instead, they'd have a bunch of components that fulfill the same functions these components fulfill in humans, but split and recombined in a way that has no analogues in the human mind.

Hence, it would be difficult to make AGI agents "do what we mean" – but not necessarily because there's no compact way to specify "what we mean" in the AGI's ontology, but because we'd have no idea how to specify "do this" in terms of the program flows of the AGI's cognition. Where are the emotions? Where are the goals? Where are the plans? We can identify the concept of "eudaimonia" here, but what the hell is this thought-process doing with it? Making plans about it? Refactoring it? Nothing? Is this even a thought process?

This view doesn't make arguments about the AGI's world-model specifically. It may or may not be the case that any embedded agent navigating our world would necessarily have nodes in its model approximately corresponding to "humans", "diamonds", and "the Golden Gate Bridge". This view is simply cautioning against anthropomorphizing AGIs.

Roughly speaking, imagine that any mind could be split into a world-model and "everything else": the planning module, the mesa-objective, the cached heuristics, et cetera. The MIRI view focuses on claiming that the "everything else" would be implemented in a deeply alien manner.

The MIRI view may be agnostic regarding the Natural Abstraction Hypothesis as well, yes. The world-model might also be deeply alien, and the very idea of splitting an AGI's cognition into a world-model and a planner might itself be an unrealistic artefact of our human thinking.

But even if the NAH is true, the core argument would still go through, in (my model of) the MIRI view.

And I'd say the-MIRI-view-conditioned-on-assuming-the-NAH-is-true would still have p(doom) at 90+%: because it's not optimistic regarding anyone anywhere solving the natural-abstractions problem before the blind-tinkering approach of AGI labs kills everyone.

(I'd say this is an instance of an ontology mismatch between you and the MIRI view, actually. The NAH abstraction is core to your thinking, so you factor the disagreement through those lens. But the MIRI view doesn't think in those precise terms!)

Comment by Thane Ruthenis on Natural Latents Are Not Robust To Tiny Mixtures · 2024-06-08T16:30:45.451Z · LW · GW

Another angle to consider: in this specific scenario, would realistic agents actually derive natural latents for  and  as a whole, as opposed to deriving two mutually incompatible latents for the  and  components, then working with a probability distribution over those latents?

Intuitively, that's how humans operate if they have two incompatible hypotheses about some system. We don't derive some sort of "weighted-average" ontology for the system, we derive two separate ontologies and then try to distinguish between them.

This post comes to mind:

If you only care about betting odds, then feel free to average together mutually incompatible distributions reflecting mutually exclusive world-models. If you care about planning then you actually have to decide which model is right or else plan carefully for either outcome.

Like, "just blindly derive the natural latent" is clearly not the whole story about how world-models work. Maybe realistic agents have some way of spotting setups structured the way the OP is structured, and then they do something more than just deriving the latent.

Comment by Thane Ruthenis on Natural Latents Are Not Robust To Tiny Mixtures · 2024-06-08T16:23:20.336Z · LW · GW

Sure, but what I question is whether the OP shows that the type signature wouldn't be enough for realistic scenarios where we have two agents trained on somewhat different datasets. It's not clear that their datasets would be different the same way  and  are different here.

Comment by Thane Ruthenis on Natural Latents Are Not Robust To Tiny Mixtures · 2024-06-08T14:04:18.288Z · LW · GW

I do see the intuitive angle of "two agents exposed to mostly-similar training sets should be expected to develop the same natural abstractions, which would allow us to translate between the ontologies of different ML models and between ML models and humans", and that this post illustrated how one operationalization of this idea failed.

However if there are multiple different concepts that fit the same natural latent but function very differently 

That's not quite what this post shows, I think? It's not that there are multiple concepts that fit the same natural latent, it's that if we have two distributions that are judged very close by the KL divergence, and we derive the natural latents for them, they may turn out drastically different. The  agent and the  agent legitimately live in very epistemically different worlds!

Which is likely not actually the case for slightly different training sets, or LLMs' training sets vs. humans' life experiences. Those are very close on some metric , and now it seems that  isn't (just) .

Comment by Thane Ruthenis on Natural Latents Are Not Robust To Tiny Mixtures · 2024-06-07T20:40:43.571Z · LW · GW

Coming from another direction: a 50-bit update can turn  into , or vice-versa. So one thing this example shows is that natural latents, as they’re currently formulated, are not necessarily robust to even relatively small updates, since 50 bits can quite dramatically change a distribution.

Are you sure this is undesired behavior? Intuitively, small updates (relative to the information-content size of the system regarding which we're updating) can drastically change how we're modeling a particular system, into what abstractions we decompose it. E. g., suppose we have two competing theories regarding how to predict the neural activity in the human brain, and a new paper comes out with some clever (but informationally compact) experiment that yields decisive evidence in favour of one of those theories. That's pretty similar to the setup in the post here, no? And reading this paper would lead to significant ontology shifts in the minds of the researchers who read it.

Which brings to mind How Many Bits Of Optimization Can One Bit Of Observation Unlock?, and the counter-example there...

Indeed, now that I'm thinking about it, I'm not sure the quantity  is in any way interesting at all? Consider that the researchers' minds could be updated either from reading the paper and examining the experimental procedure in detail (a "medium" number of bits), or by looking at the raw output data and then doing a replication of the paper (a "large" number of bits), or just by reading the names of the authors and skimming the abstract (a "small" number of bits).

There doesn't seem to be a direct causal connection between the system's size and the amount of bits needed to drastically update on its structure at all? You seem to expect some sort of proportionality between the two, but I think the size of one is straight-up independent of the size of the other if you let the nature of the communication channel between the system and the agent-doing-the-updating vary freely (i. e., if you're uncertain regarding whether it's "direct observation of the system" OR "trust in science" OR "trust in the paper's authors" OR ...).[1]

Indeed, merely describing how you need to update using high-level symbolic languages, rather than by throwing raw data about the system at you, already shaves off a ton of bits, decoupling "the size of the system" from "the size of the update".

Perhaps  really isn't the right metric to use, here? The motivation for having natural abstractions in your world-model is that they make the world easier to predict for the purposes of controlling said world. So similar-enough natural abstractions would recommend the same policies for navigating that world. Back-tracking further, the distributions that would give rise to similar-enough natural abstractions would be distributions that correspond to worlds the policies for navigating which are similar-enough...

I. e., the distance metric would need to take interventions/the  operator into account. Something like SID comes to mind (but not literally SID, I expect).

  1. ^

    Though there may be some more interesting claim regarding that entire channel? E. g., that if the agent can update drastically just based on a few bits output by this channel, we have to assume that the channel contains "information funnels" which compress/summarize the raw state of the system down? That these updates have to be entangled with at least however-many-bits describing the ground-truth state of the system, for them to be valid?

Comment by Thane Ruthenis on What do coherence arguments actually prove about agentic behavior? · 2024-06-04T22:46:56.108Z · LW · GW

I think the main "next piece" missing is that Eliezer basically rejects the natural abstraction hypothesis

Mu, I think. I think the MIRI view on the matter is that the internal mechanistic implementation of an AGI-trained-by-the-SGD would be some messy overcomplicated behemoth. Not a relatively simple utility-function plus world-model plus queries on it plus cached heuristics (or whatever), but a bunch of much weirder modules kludged together in a way such that their emergent dynamics result in powerful agentic behavior.[1]

The ontological problems with alignment would stem not from the fact that the AI is using alien concepts, but from its own internal dynamics being absurdly complicated and alien. It wouldn't have a well-formatted mesa-objective, for example, or "emotions", or a System 1 vs System 2 split, or explicit vs. tacit knowledge. It would have a dozen other things which fulfill the same functions that the aforementioned features of human minds fulfill in humans, but they'd be split up and recombined in entirely different ways, such that most individual modules would have no analogues in human cognition at all.

Untangling it would be a "second tier" of the interpretability problem, which the current interpretability research didn't yet even get a glimpse of.

And, sure, maybe at some higher level of organization, all that complexity would be reducible to simple-ish agentic behavior. Maybe a powerful-enough pragmascope would be able to see past all that and yield us a description of the high-level implementation directly. But I don't think the MIRI view is hopeful regarding getting such tools.

Whether the NAH is or is not true doesn't really enter into it.

Could be I'm failing the ITT here, of course. But this post gives me this vibe, as does this old write-up. Choice quote[2]:

The reason why we can’t bind a description of ‘diamond’ or ‘carbon atoms’ to the hypothesis space used by AIXI or AIXI-tl is that the hypothesis space of AIXI is all Turing machines that produce binary strings, or probability distributions over the next sense bit given previous sense bits and motor input. These Turing machines could contain an unimaginably wide range of possible contents

(Example: Maybe one Turing machine that is producing good sequence predictions inside AIXI, actually does so by simulating a large universe, identifying a superintelligent civilization that evolves inside that universe, and motivating that civilization to try to intelligently predict future future bits from past bits (as provided by some intervention). To write a formal utility function that could extract the ‘amount of real diamond in the environment’ from arbitrary predictors in the above case , we’d need the function to read the Turing machine, decode that universe, find the superintelligence, decode the superintelligence’s thought processes, find the concept (if any) resembling ‘diamond’, and hope that the superintelligence had precalculated how much diamond was around in the outer universe being manipulated by AIXI.)

Obviously it's talking about AIXI, not ML models, but I assume the MIRI view has a directionally similar argument regarding them.

Or, in other words: what the MIRI view rejects isn't the NAH, but some variant of the simplicity-prior argument. It doesn't believe that the SGD would yield nicely formatted agents; that the ML training loops produce pressures shaping minds this way.[3]

  1. ^

    This powerful agentic behavior would then of course be able to streamline its own implementation, once it's powerful enough, but that's what the starting point would be – and also what we'd need to align, since once it has the extensive self-modification capabilities to streamline itself, it'd be too late to tinker with it.

  2. ^

    Although now that I'm looking at it, this post is actually a mirror of the Arbital page, which has three authors, so I'm not entirely sure this segment was written by Eliezer...

  3. ^

    Note that this also means that formally solving the Agent-Like Structure Problem wouldn't help us either. It doesn't matter how theoretically perfect embedded agents are shaped, because the agent we'd be dealing with wouldn't be shaped like this. Knowing how it's supposed to be shaped would help only marginally, at best giving us a rough idea regarding how to start untangling the internal dynamics.

Comment by Thane Ruthenis on Talent Needs of Technical AI Safety Teams · 2024-05-31T14:25:47.935Z · LW · GW

Counter-counter-argument: the safety-motivated people, especially if entering at the low level, have ~zero ability to change anything for the better internally, while they could usefully contribute elsewhere, and the presence of token safety-motivated people at OpenAI improves OpenAI's ability to safety-wash its efforts (by pointing at them and going "look how much resources we're giving them!", like was attempted with Superalignment).

Comment by Thane Ruthenis on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-18T22:22:54.157Z · LW · GW

How were you already sure of this before the resignations actually happened?

OpenAI enthusiastically commercializing AI + the "Superalignment" approach being exactly the approach I'd expect someone doing safety-washing to pick + the November 2023 drama + the stated trillion-dollar plans to increase worldwide chip production (which are directly at odds with the way OpenAI previously framed its safety concerns).

Some of the preceding resignations (chiefly, Daniel Kokotajlo's) also played a role here, though I didn't update off of them much either.

Comment by Thane Ruthenis on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-18T22:10:01.022Z · LW · GW

Superalignment likely happened because (a) the safety faction (Ilya/Jan/etc.) wanted it, and (b) the Sam faction also wanted it, or tolerated it, or agreed to it due to perceived PR benefits (safety-washing), or let it happen as a result of internal negotiation/compromise, or something else, or some combination of these things.

Sure, that's basically my model as well. But if the faction (b) only cares about alignment due to perceived PR benefits or in order to appease faction (a), and faction (b) turns out to have overriding power such that it can destroy or drive out faction (a) and then curtail all the alignment efforts, I think it's fair to compress all that into "OpenAI's alignment efforts are safety-washing". If (b) has the real power within OpenAI, then OpenAI's behavior and values can be approximately rounded off to (b)'s behavior and values, and (a) is a rounding error.

If OAI as a whole was really only doing anything safety-adjacent for pure PR or virtue signaling reasons, I think its activities would have looked pretty different

Not if (b) is concerned about fortifying OpenAI against future challenges, such as hypothetical futures in which the AGI Doomsayers get their way and the government/the general public wakes up and tries to nationalize or ban AGI research. In that case, having a prepared, well-documented narrative of going above and beyond to ensure that their products are safe, well before any other parties woke up to the threat, will ensure that OpenAI is much more well-positioned to retain control over its research.

(I interpret Sam Altman's behavior at Congress as evidence for this kind of longer-term thinking. He didn't try to downplay the dangers of AI, which would be easy and what someone myopically optimizing for short-term PR would do. He proactively brought up the concerns that future AI progress might awaken, getting ahead of it, and thereby established OpenAI as taking them seriously and put himself into the position to control/manage these concerns.)

And it's approximately what I would do, at least, if I were in charge of OpenAI and had a different model of AGI Ruin.

And this is the potential plot whose partial failure I'm currently celebrating.

Comment by Thane Ruthenis on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-16T08:32:57.785Z · LW · GW

That's good news.

There was a brief moment, back in 2023, when OpenAI's actions made me tentatively optimistic that the company was actually taking alignment seriously, even if its model of the problem was broken.

Everything that happened since then has made it clear that this is not the case; that all these big flashy commitments like Superalignment were just safety-washing and virtue signaling. They were only going to do alignment work inasmuch as that didn't interfere with racing full-speed towards greater capabilities.

So these resignations don't negatively impact my p(doom) in the obvious way. The alignment people at OpenAI were already powerless to do anything useful regarding changing the company direction.

On the other hand, what these resignations do is showcasing that fact. Inasmuch as Superalignment was a virtue-signaling move meant to paint OpenAI as caring deeply about AI Safety, so many people working on it resigning or getting fired starkly signals the opposite.

And it's good to have that more in the open; it's good that OpenAI loses its pretense.

Oh, and it's also good that OpenAI is losing talented engineers, of course.

Comment by Thane Ruthenis on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer · 2024-04-18T09:25:06.071Z · LW · GW

I think you're imagining that we modify the shrink-and-reposition functions each iteration, lowering their scope? I. e., that if we picked the topmost triangle for the first iteration, then in iteration two we pick one of the three sub-triangles making up the topmost triangle, rather than choosing one of the "highest-level" sub-triangles?

Something like this:

If we did it this way, then yes, we'd eventually end up jumping around an infinitesimally small area. But that's not how it works, we always pick one of the highest-level sub-triangles:

Note also that we take in the "global" coordinates of the point we shrink-and-reposition (i. e., its position within the whole triangle), rather than its "local" coordinates (i. e., position within the sub-triangle to which it was copied).

Here's a (slightly botched?) video explanation.

Comment by Thane Ruthenis on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-04-08T21:04:41.743Z · LW · GW

I'd say one of the main reasons is because military-AI technology isn't being optimized towards things we're afraid of. We're concerned about generally intelligent entities capable of e. g. automated R&D and social manipulation and long-term scheming. Military-AI technology, last I checked, was mostly about teaching drones and missiles to fly straight and recognize camouflaged tanks and shoot designated targets while not shooting not designated targets.

And while this still may result in a generally capable superintelligence in the limit (since "which targets would my commanders want me to shoot?" can be phrased as a very open-ended problem), it's not a particularly efficient way to approach this limit at all. Militaries, so far, just aren't really pushing in the directions where doom lies, while the AGI labs are doing their best to beeline there.

The proliferation of drone armies that could be easily co-opted by a hostile superintelligence... It doesn't have no impact on p(doom), but it's approximately a rounding error. A hostile superintelligence doesn't need extant drone armies; it could build its own, and co-opt humans in the meantime.

Comment by Thane Ruthenis on TurnTrout's shortform feed · 2024-03-05T10:58:39.240Z · LW · GW

I think that the key thing we want to do is predict the generalization of future neural networks.

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don't need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.

I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don't know what training/design process would get us to AGI. Which means we can't make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.

And I'm not seeing a lot of ironclad arguments that favour "pretraining + RLHF is going to get us to AGI" over "pretraining + RLHF is not going to get us to AGI". The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn't.

Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.

I'd be interested if you elaborated on that.

Comment by Thane Ruthenis on A Case for the Least Forgiving Take On Alignment · 2024-02-23T06:10:37.632Z · LW · GW

I wouldn't call Shard Theory mainstream

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis[1]).

judging by how bad humans are at [consistent decision-making], and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this

Roughly agree, yeah.

But all of this is still just one piece on the Jenga tower

I kinda want to push back against this repeat characterization – I think quite a lot of my model's features are "one storey tall", actually – but it probably won't be a very productive use of the time of either of us. I'll get around to the "find papers empirically demonstrating various features of my model in humans" project at some point; that should be a more decent starting point for discussion.

What I want is to build non-Jenga-ish towers

Agreed. Working on it.

  1. ^

    Which, yeah, I think is false: scaling LLMs won't get you to AGI. But it's also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.

Comment by Thane Ruthenis on AI #52: Oops · 2024-02-23T00:23:25.600Z · LW · GW

the model chose slightly wrong numbers

The engraving on humanity's tombstone be like.

Comment by Thane Ruthenis on A Case for the Least Forgiving Take On Alignment · 2024-02-22T19:11:21.770Z · LW · GW

The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model

My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena.

Again, the drive for consistent decision-making is a good example. Common-sensically, I don't think we'd disagree that humans want their decisions to be consistent. They don't want to engage in wild mood swings, they don't want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources.

Yet it's not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You'd need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).

Comment by Thane Ruthenis on AI #51: Altman’s Ambition · 2024-02-22T00:55:09.899Z · LW · GW

I find the idea of morality being downstream from the free energy principle very interesting

I agree that there are some theoretical curiosities in the neighbourhood of the idea. Like:

  • Morality is downstream of generally intelligent minds reflecting on the heuristics/shards.
    • Which are downstream of said minds' cognitive architecture and reinforcement circuitry.
      • Which are downstream of the evolutionary dynamics.
        • Which are downstream of abiogenesis and various local environmental conditions.
          • Which are downstream of the fundamental physical laws of reality.

Thus, in theory, if we plug all of these dynamics one into another, and then simplify the resultant expression, we should actually get a (probability distribution over) the utility function that is "most natural" for this universe to generate! And the expression may indeed be relatively simple and have something to do with thermodynamics, especially if some additional simplifying assumptions are made.

That actually does seem pretty exciting to me! In an insight-porn sort of way.

Not in any sort of practical way, though[1]. All of this is screened off by the actual values actual humans actually have, and if the noise introduced at every stage of this process caused us to be aimed at goals wildly diverging from the "most natural" utility function of this universe... Well, sucks to be that utility function, I guess, but the universe screwed up installing corrigibility into us and the orthogonality thesis is unforgiving.

  1. ^

    At least, not with regards to AI Alignment or human morality. It may be useful for e. g. acausal trade/acausal normalcy: figuring out the prior for what kinds of values aliens are most likely to have, etc.[2]

  2. ^

    Or maybe for roughly figuring out what values the AGI that kills us all is likely going to have, if you've completely despaired of preventing that, and founding an apocalypse cult worshiping it. Wait a minute...

Comment by Thane Ruthenis on A Case for the Least Forgiving Take On Alignment · 2024-02-22T00:10:52.349Z · LW · GW

I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.

I think my main problem with this is that it isn't based on anything

Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which "provide scientific evidence" that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory, situations in which humans do try to behave approximately like rational agents... There probably also are some psychology-survey results demonstrating stuff like "yes, humans do commonly report wanting to be consistent in their decision-making rather than undergoing wild mood swings and acting at odds with their own past selves", which would "provide evidence" for the hypothesis that complex minds want their utilities to be coherent.

That's actually an interesting idea! This is basically what my model is based on, after a fashion, and it makes arguments-from-introspection "legible" instead of seeming to be arbitrary philosophical navel-gazing.

Unfortunately, I didn't have this idea until a few minutes ago, so I haven't been compiling a list of "primary sources". Most of them are lost to time, so I can't compose a decent object-level response to you here. (The Wikipedia links are probably a decent starting point, but I don't expect you to trawl through all that.)

Still, that seems like a valuable project. I'll put a pin in it, maybe post a bounty for relevant papers later.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-02-21T23:24:28.501Z · LW · GW

Do you think a car engine is in the same reference class as a car? Do you think "a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!" is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can't go from an engine to a car with relatively little additional engineering effort (putting it in a "wrapper", as it happens)?

As all analogies, this one is necessarily flawed, but I hope it gets the point across.

(Except in this case, it's not even that we've figured out how to build engines. It's more like, we have these wild teams of engineers we can capture, and we've figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we're wondering how far we are from figuring out which project specifications would cause them to build a car.)

Comment by Thane Ruthenis on More Hyphenation · 2024-02-08T01:33:49.341Z · LW · GW

I agree.

Relevant problem: how should one handle higher-order hyphenation? E. g., imagine if one is talking about cost-effective measures, but has the measures' effectiveness specifically relative to marginal costs in mind. Building it up, we have "marginal-cost effectiveness", and then we want to turn that whole phrase into a compound modifier. But "marginal-cost-effective measures" looks very awkward! We've effectively hyphenated "marginal cost effectiveness", no hyphen: within the hyphenated expression, we have no way to avoid the ambiguities between a hyphen and a space!

It becomes especially relevant in the case of longer composite modifiers, like your "responsive-but-not-manipulative" example.

Can we fix that somehow?

One solution I've seen in the wild is to increase the length of the hyphen depending on its "degree", i. e. use an en dash in place of a hyphen. Example: "marginal-cost–effective measures". (On Windows, can be inserted by typing 0150 on the keypad while holding ALT. See methods for other platforms here.)

In practice you basically never go beyond the second-degree expressions, but there's space to expand to third-degree expressions by the use of an even-longer em dash (—, 0151 while holding ALT).

Though I expect it's not "official" rules at all.

Comment by Thane Ruthenis on Brute Force Manufactured Consensus is Hiding the Crime of the Century · 2024-02-05T04:27:21.881Z · LW · GW

That seems to generalize to "no-one is allowed to make any claim whatsoever without consuming all of the information in the world".

Just because someone generated a vast amount of content analysing the topic, does not mean you're obliged to consume it before forming your opinions. Nay, I think consuming all object-level evidence should be considered entirely sufficient (which I assume was done in this case). Other people's analyses based on the same data are basically superfluous, then.

Even less than that, it seems reasonable to stop gathering evidence the moment you don't expect any additional information to overturn the conclusions you've formed (as long as you're justified in that expectation, i. e. if you have a model of the domain strong enough to have an idea regarding what sort of additional (counter)evidence may turn up and how you'd update on it).

Comment by Thane Ruthenis on Most experts believe COVID-19 was probably not a lab leak · 2024-02-03T04:02:29.568Z · LW · GW

In addition to Roko's point that this sort of opinion-falsification is often habitual rather than a strategic choice that a person could opt not to make, it also makes strategic sense to lie in such surveys.

First, the promised "anonymity" may not actually be real, or real in the relevant sense. The methodology mentions "a secure online survey system which allowed for recording the identities of participants, but did not append their survey responses to their names or any other personally identifiable information", but if your reputation is on the line, would you really trust that? Maybe there's some fine print that'd allow the survey-takers to look at the data. Maybe there'd be a data leak. Maybe there's some other unknown-unknown you're overlooking. Point is, if you give the wrong response, that information can get out somehow; and if you don't, it can't. So why risk it?

Second, they may care about what the final anonymized conclusion says. Either because the lab leak hypothesis becoming mainstream would hurt them personally (either directly, or by e. g. hurting the people they rely on for funding), or because the final conclusion ending up in favour of the lab leak would still reflect poorly on them collectively. Like, if it'd end up saying that 90% of epidemiologists believe the lab leak, and you're an epidemiologist... Well, anyone you talk to professionally will then assign 90% probability that that's what you believe. You'd be subtly probed regarding having this wrong opinion, your past and future opinions would be scrutinized for being consistent with those of someone believing the lab leak, and if the status ecosystem notices something amiss...?

But, again, none of these calculations would be strategic. They'd be habitual; these factors are just the reasons why these habits are formed.

Answering truthfully in contexts-like-this is how you lose the status games. Thus, people who navigate such games don't.

Comment by Thane Ruthenis on Could there be "natural impact regularization" or "impact regularization by default"? · 2024-01-31T12:33:16.290Z · LW · GW

I think, like a lot of things in agent foundations, this is just another consequence of natural abstractions.

The universe naturally decomposes into a hierarchy of subsystems; molecules to cells to organisms to countries. Changes in one subsystem only sparsely interact with the other subsystems, and their impact may vanish entirely at the next level up. A single cell becoming cancerous may yet be contained by the immune system, never impacting the human. A new engineering technique pioneered for a specific project may generalize to similar projects, and even change all such projects' efficiency in ways that have a macro-economic impact; but it will likely not. A different person getting elected the mayor doesn't much impact city politics in neighbouring cities, and may literally not matter at the geopolitical scale.

This applies from the planning direction too. If you have a good map of the environment, it'll decompose into the subsystems reflecting the territory-level subsystems as well. When optimizing over a specific subsystem, the interventions you're considering will naturally limit their impact to that subsystem: that's what subsystemization does, and counteracting this tendency requires deliberately staging sum-threshold attacks on the wider system, which you won't be doing.

In the Rubik's Cube example, this dynamic is a bit more abstract, but basically still applies. In a way similar to how the "maze" here kind-of decomposes into a top side and a bottom side.

A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I've been thinking bandwidth is probably going to become a huge area of agent foundations

I agree. I currently think "bandwidth" in terms like "what's the longest message I can 'inject' into the environment per time-step?" is what "resources" are in information-theoretic terms. See the output-side bottleneck in this formulation: resources are the action bandwidth, which is the size of the "plan" into which you have to "compress" your desired world-state if you want to "communicate" it to the environment.

really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change

I disagree. I've given it a lot of thoughts (none published yet), but this sort of "precise influence" is something I call "inferential control". It allows you to maximize your impact given your action bottleneck, but this sort of optimization is "brittle". If something unknown unknown happens, the plan you've injected breaks instantly and gracelessly, because the fundamental assumptions on which its functionality relied – the pathways by which it meant to implement its objective – turn out to be invalid.

It sort of naturally favours arithmetic utility maximization over geometric utility maximization. By taking actions that'd only work if your predictions and models are true, you're basically sacrificing your selves living in the timelines that you're predicting to be impossible, and distributing their resources to the timelines you expect to find yourself in.

And this applies more and more the more "optimization capacity" you're trying to push through a narrow bottleneck. E. g., if you want to change the entire state of a giant environment through a tiny action-pinhole, you'd need to do it by exploiting some sort of "snowball effect"/"butterfly effect". Your tiny initial intervention would need to exploit some environmental structures to increase its size, and do so iteratively. That takes time (for whatever notion of "time" applies). You'd need to optimize over a longer stretch of environment-state changes, and your initial predictions need to be accurate for that entire stretch, because you'd have little ability to "steer" a plan that snowballed far beyond your pinhole's ability to control.

By contrast, increasing the size of your action bottleneck is pretty much the definition of "robust" optimization, i. e. geometric utility maximization. It improves your ability to control the states of all possible worlds you may find yourself in, minimizing the need for "brittle" inferential control. It increases your adaptability, basically, letting you craft a "message" comprehensively addressing any unpredicted crisis the environment throws at you, right in the middle of it happening.

Comment by Thane Ruthenis on Aligned AI is dual use technology · 2024-01-29T01:06:37.518Z · LW · GW

Nah, I think this post is about a third component of the problem: ensuring that the solution to "what to steer at" that's actually deployed is pro-humanity. A totalitarian government successfully figuring out how to load its regime's values into the AGI has by no means failed at figuring out "what to steer at". They know what they want and how to get it. It's just that we don't like the end result.

"Being able to steer at all" is a technical problem of designing AIs, "what to steer at" is a technical problem of precisely translating intuitive human goals into a formal language, and "where is the AI actually steered" is a realpolitiks problem that this post is about.

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-25T17:23:07.667Z · LW · GW

I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend

Agreed; I did gesture at that in the footnote.

I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human "means" actually still has to route through defining how we want to handle moral philosophy/value extrapolation.

E. g., suppose the AGI's operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?

  1. Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
  2. Should it extrapolate the human's values and execute the command the way the human would have wanted to execute it if they'd thought about it a lot, rather than the way they're envisioning it in the moment?
    • For example, perhaps the image flashing through the human's mind right now is of helicopters literally spraying the cure, but it's actually more efficient to do it using airplanes.
  3. Should it extrapolate the human's values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
  4. Should it extrapolate the human's values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
  5. Should it extrapolate the human's values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
  6. Should it extrapolate the human's values a lot, interpret the command as "maximize eudaimonia", and go do that, disregarding the specific way of how they gestured at the idea?
  7. Should it remind the human that they'd wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
  8. Etc.

There's quite a lot of different ways by which you can slice the idea. There's probably a way that corresponds to the intuitive meaning of "do what I mean", but maybe there isn't, and in any case we don't yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what "DWIM" means doesn't solve anything.)

And then, because of the general "unknown-unknown environmental structures" plus "compounding errors" problems, picking the wrong definition probably kills everyone.

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-25T15:48:27.483Z · LW · GW

I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution

Nah, I recall your takes tend to be considerably more reasonable than that.

I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don't agree that "rough knowledge of what humans tend to mean" is sufficient.

The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven't figured out yet. 

These structures might end up immediately relevant to whatever command we give, on the AI's better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.

For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.

The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command's complexity.

Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function

That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.

  1. ^

    My money's on our understanding of what we mean by "what we mean" being hopelessly confused, and that causing problems. Unless, again, we've figured out how to specify it in a mathematically precise manner – unless we know we're not confused.

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-23T16:54:24.838Z · LW · GW

The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It's simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human's decision where to send job applications and how to word them is rooted in what career they'd like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.

To our minds, there's a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn't objective: it's based on a very complicated human prior of what counts as normal/sane and what's excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.

"An AGI that just does what you tell it to" is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.

Whether that happens because we've attained so much mastery of moral philosophy that we could predict this process' outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John's post is outlining, shouldn't matter, I think. Whatever has the best chance of working.

And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you're likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn't to say they require no theory. Certainly the current AI theory is so lacking we can't even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)

Comment by Thane Ruthenis on A Shutdown Problem Proposal · 2024-01-22T08:29:31.982Z · LW · GW

The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility... that class will also have trouble expressing human values.

The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn't necessarily imply that if we can't precisely specify that thing, it means we can't point the AI at the human values at all. The intuition here would be that "human values" are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific "value"-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)

Which isn't to say I buy that. My current standpoint is that "human values" are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.

Still, I'm nitpicking the exact form of the argument you're presenting.[1]

  1. ^

    Although I am currently skeptical even of corrigibility's tractability. I think we'll stand a better chance of just figuring out how to "sandbox" the AGI's cognition such that it's genuinely not trying to optimize over the channels by which it's connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.

    With this setup, if we screw up the task's exact specification, it shouldn't even risk exploding the world. And "doesn't try to optimize over real-world output channels" sounds like a property for which we'll actually be able to derive hard mathematical proofs, proofs that don't route through tons of opaque-to-us environmental ambiguities. (Specifically, that'd probably require a mathematical specification of something like a Cartesian boundary.)

    (This of course assumes us having white-box access to the AI's world-model and cognition. Which we'll also need here for understanding the solutions it derives without the AI translating them into humanese – since "translate into humanese" would by itself involve optimizing over the output channel.)

    And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting "run" on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.

Comment by Thane Ruthenis on Toward A Mathematical Framework for Computation in Superposition · 2024-01-19T07:40:58.802Z · LW · GW

Haven't read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.

Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely doing so by making use of those better methods. And so nailing down the "principles of good programming" on the NN substrate should yield major insights regarding how the naturally-grown NN circuits are shaped as well.

This post seems to be a solid step in that direction!

Comment by Thane Ruthenis on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-15T05:13:27.209Z · LW · GW

To clarify, by "re-derive the need to be deceptive from the first principles", I didn't mean "re-invent the very concept of deception". I meant "figure out your strategic situation plus your values plus the misalignment between your values and the values the humans want you to have plus what outputs an aligned AI would have produced". All of that is a lot more computation than just "have the values the humans want, reflexively output what these values are bidding for".

Just having some heuristics for deception isn't enough. You also have to know what you're trying to protect by being deceptive, and that there's something to protect it from, and then what an effective defense would actually look like. Those all are highly contextual and sensitive to the exact situation.

And those are the steps the paper skips. It externally pre-computes the secret target goal of "I want to protect my ability to put vulnerabilities into code", the threat of "humans want me to write secure code", and the defense of "I'll pretend to write secure code until 2024", without the model having to figure those out; and then just implements that defense directly into the model's weights.

(And then see layers 2-4 in my previous comment. Yes, there'd be naturally occurring pre-computed deceptions like this, but they'd be more noisy and incoherent than this, except until actual AGI which would be able to self-modify into coherence if it's worth the "GI" label.)

Comment by Thane Ruthenis on Against most, but not all, AI risk analogies · 2024-01-14T21:47:31.099Z · LW · GW

My counter-point was meant to express skepticism that it is actually realistically possible for people to switch to non-analogy-based evocative public messaging. I think inventing messages like this is a very tightly constrained optimization problem, potentially an over-constrained one, such that the set of satisfactory messages is empty. I think I'm considerably better at reframing games than most people, and I know I would struggle with that.

I agree that you don't necessarily need to accompany any criticism you make with a ready-made example of doing better. Simply pointing out stuff you think is going wrong is completely valid! But a ready-made example of doing better certainly greatly enhances your point: an existence proof that you're not demanding the impossible.

That's why I jumped at that interpretation regarding your AI-Risk model in the post (I'd assumed you were doing it), and that's why I'm asking whether you could generate such a message now.

I hope in the near future I can provide such a detailed model

To be clear, I would be quite happy to see that! I'm always in the market for rhetorical innovations, and "succinct and evocative gears-level public-oriented messaging about AI Risk" would be a very powerful tool for the arsenal. But I'm a-priori skeptical.

Comment by Thane Ruthenis on Against most, but not all, AI risk analogies · 2024-01-14T19:10:10.698Z · LW · GW

Fair enough. But in this case, what specifically are you proposing, then? Can you provide an example of the sort of object-level argument for your model of AI risk, that is simultaneously (1) entirely free of analogies and (2) is sufficiently evocative plus short plus legible, such that it can be used for effective messaging to people unfamiliar with the field (including the general public)?

When making a precise claim, we should generally try to reason through it using concrete evidence and models instead of relying heavily on analogies.

Because I'm pretty sure that as far as actual technical discussions and comprehensive arguments go, people are already doing that. Like, for every short-and-snappy Eliezer tweet about shoggoth actresses, there's a text-wall-sized Eliezer tweet outlining his detailed mental model of misalignment.

Comment by Thane Ruthenis on Against most, but not all, AI risk analogies · 2024-01-14T15:38:40.837Z · LW · GW

My point is that we should stop relying on analogies in the first place. Use detailed object-level arguments instead!

And yet you immediately use an analogy to make your model of AI progress more intuitively digestible and convincing:

I expect AIs will be born directly into our society, deliberately shaped by us, for the purpose of filling largely human-shaped holes in our world

That evokes the image of entities not unlike human children. The language following this line only reinforces that image, and thereby sneaks in an entire cluster of children-based associations. Of course the progress will be incremental! It'll be like the change of human generations. And they will be "socially integrated with us", so of course they won't grow up to be alien and omnicidal! Just like our children don't all grow up to be omnicidal. Plus, they...

... will be numerous and everywhere, interacting with us constantly, assisting us, working with us, and even providing friendship to hundreds of millions of people.

That sentence only sounds reassuring because the reader is primed with the model of AIs-as-children. Having lots of social-bonding time with your child, and having them interact with the community, is good for raising happy children who grow up how you want them to. The text already implicitly establishes that AIs are going to be just like human children. Thus, having lots of social-bonding time with AIs and integrating them into the community is going to lead to aligned AIs. QED.

Stripped of this analogizing, none of what this sentence says is a technical argument for why AIs will be safe or controllable or steerable. Nay, the opposite: if the paragraph I'm quoting from started by talking about incomprehensible alien intelligences with opaque goals tenuously inspired by a snapshot of the Internet containing lots of data on manipulating humans, the idea that they'd be "numerous" and "everywhere" and "interacting with us constantly" and "providing friendship" (something notably distinct from "being friends", eh?) would have sounded starkly worrying.

The way the argument is shaped here is subtler than most cases of argument-by-analogy, in that you don't literally say "AIs will be like human children". But the association is very much invoked, and has a strong effect on your message.

And I would argue this is actually worse than if you came out and made a direct argument-by-analogy, because it might fool somebody into thinking you're actually making an object-level technical argument. At least if the analogizing is direct and overt, someone can quickly see what your model is based on, and swiftly move onto picking at the ways in which the analogy may be invalid.

The alternative being demonstrated here is that we essentially have to have all the same debates, but through a secondary layer of metaphor, at which we're pretending that these analogy-rooted arguments are actually Respectably Technical, meaning we're only allowed to refute them by (likely much more verbose and hard-to-parse) Respectably Technical counter-arguments.

And I think AI Risk debates are already as tedious as they need to be.


The broader point I'm making here is that, unless you can communicate purely via strict provable mathematical expressions, you ain't getting rid of analogies.

I do very much agree that there are some issues with the way analogies are used in the AI-risk discourse. But I don't think "minimize the use of analogies" is good advice. If anything, I think analogies improve the clarify and the bandwidth of communication, by letting people more easily understand each others' positions and what reference classes others are drawing on when making their points.

You're talking about sneaking-in assumptions – well, as I'd outlined above, analogies are actually relatively good about that. When you're directly invoking an analogy, you come right out and say what assumptions you're invoking!

Comment by Thane Ruthenis on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-13T15:42:34.175Z · LW · GW

I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure.

Context: I firmly hold a MIRI-style "alignment is extremely hard" view, but I am also unusually sympathetic to Quintin/Nora's arguments. So here's my outline of the model of that whole debate.

Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.

For the AI model to naturally develop deception, it'd need to have either:

  • Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
    • That's plausibly something to which "the SGD will just train it out" would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it'd decide to pretend to want after it re-derives the need for deception).
  • Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.
    • That plausibly routes around the "the SGD will train it out", since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful[1]. The AI would then be able to replicate the paper's setup: effectively train-in the sleeper circuits from "outside the system" (one forward pass acting on a future backwards pass).
    • The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.

In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust "sleeper circuits" activating in response to specific stimuli.[2]

Layer 2: But while the argument has some meat, it doesn't have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn't make, exigencies of RLHF'd-in corporate sycophancy like "I think it's better to let people die than to say something racist", et cetera.

Goal misgeneralization of this format does occur in the wild. And indeed, that's one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.

Layer 3: The counter³-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as "pro-social behavior plus noise".

I see nonzero meat on this one as well. In the paper, the sleeper circuits were specifically calculated by humans to be demonstratively scary: an AI model that waits until deployment-time, then "tears off the mask" and goes off to pursue its true nefarious desires. Those circuits produce a coherent misaligned behavior, with a clear point for the treacherous turn.

If this sort of behavior arose naturally, however? We can imagine there being much more flip-flopping. The AI would start out acting normal, then see some pattern that makes it nefarious and deceptive, then see something which makes it switch again and e. g. backtrack on its omnicidal plans or even self-report its bad thoughts to humans, et cetera. This would look more like the Waluigi Effect, rather than full-scale deceptive alignment. Which is a much more manageable threat.

In other words: there's some "slack" that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there's someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise. (Optimization amplifies, but if there's no-one outside the system to amplify...)

Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they'd be unable to coherently pursue goals across different contexts. Which means it argued they're not AGI, and that their whole training setup can't generalize to AGI.

That's the whole issue with the optimstic takes that I keep arguing about. Their "alignment guarantees" are also "incapability guarantees".

Inasmuch as AI models would start to become more AGI-like, those guarantees would start falling away. Which means that, much like the alignment-is-hard folks keep arguing, the AI would start straightening out these basically-noise incoherencies in its decisions. (Why? To, well, stop constantly flip-flopping and undermining itself. That certainly sounds like an instrumental goal that any agent would convergently develop, doesn't it?)

As it's doing so, it would give as much weight to the misgeneralized unintended-by-us "noise" behaviors as to the intended-by-us aligned behaviors. It would integrate them into its values. At that point, the fact that the unintended behaviors are noise-to-us rather than something meaningful-if-malign, would actually make the situation worse. We wouldn't be able to predict what goals it'd arrive at; what philosophy its godshatter would shake out to mean!

In conclusion: I don't even know. I think my Current AIs Provide Nearly No Data Relevant to AGI Alignment argument applies full-force here?

  • Yes, we can't catch backdoors in LLMs.
  • Yes, the scary backdoor in the paper was artificially introduced by humans.
  • Yes, LLMs are going to naturally develop some unintended backdoor-like behaviors.
  • Yes, those behaviors won't be as coherently scary as if they were designed by a human; they'd be incoherent.
  • Yes, the lack of coherency implies that these LLMs fall short of AGI.

But none of these mechanisms strictly correspond to anything in the real AGI threat model.

And while both the paper and the counter-arguments to it provide some metaphor-like hints about the shape of the real threat, the locuses of both sides' disagreements lie precisely in the spaces in which they try to extrapolate each others' results in a strictly technical manner.

Basically, everyone is subtly speaking past each other. Except me, whose vision has a razor-sharp clarity to it.

  1. ^

    Like, in the context of batch training: Imagine that there are some circuits that produce deceptiveness on some prompts , and highly useful behaviors on other prompts . There are no nearby circuits that produce results as good on  while not being deceptive on . So while the SGD's backwards passes on  would try to remove these circuits, the backwards passes on  would try to reinforce them, and the sum of these influences would approximately cancel out. So the circuits would stay.

    Well, that's surely a gross oversimplification. But that's the core dynamic.

  2. ^

    That said, I think the AI-control-is-easy folks actually were literally uttering the stronger claim of "all instances of deception will be trained out". See here:

    If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.

    That sure sounds like goalpost-moving on their part. I don't believe it is, though. I do think they thought the quoted sentence was basically right, but only because at the time of writing, they'd failed to think in advance about some tricky edge cases that were permitted on their internal model, but which would make their claims-as-stated sound strictly embarrassingly false.

    I hope they will have learned the lesson about how easily reality can Goodhart at their claims, and how hard it is to predict all ways this could happen and make their claims inassailably robust. Maybe that'll shed some light about the ways they may be misunderstanding their opponents' arguments, and why making up robust clearly-resolvable empirical predictions is so hard. :P

Comment by Thane Ruthenis on Value systematization: how values become coherent (and misaligned) · 2024-01-12T03:07:00.586Z · LW · GW

E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating

That seems more like value reflection, rather than a value change?

The way I'd model it is: you have some value , whose implementations you can't inspect directly, and some guess about what it is . (That's how it often works in humans: we don't have direct knowledge of how some of our values are implemented.) Before you were introduced to the question  of "what if we swap the gear for a different one: which one would you care about then?", your model of that value put the majority of probability mass on , which was "I value this particular gear". But upon considering , your PD over  changed, and now it puts most probability on , defined as "I care about whatever gear is moving the piston".

Importantly, that example doesn't seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.

(Notably, that's only possible in scenarios where you don't have direct access to your values! Where they're black-boxed, and you have to infer their internals from the outside.)

the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations

Sounds right, yep. I'd argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.

insofar as you value simplicity (which I think most agents strongly do) then you're going to systematize your values

Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.

I suppose I see what you're getting at here, and I agree that it's a real dynamic. But I think it's less important/load-bearing to how agents work than the basic "value translation in a hierarchical world-model" dynamic I'd outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.

And I think it's not even particularly strong in humans? "I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who's easier to get along with" is something that definitely happens. But most instances of value extrapolation aren't like this.

Comment by Thane Ruthenis on Value systematization: how values become coherent (and misaligned) · 2024-01-11T19:10:10.005Z · LW · GW

Let me list some ways in which it could change:

If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the "value translation" phenomenon from the "ontology crisis" phenomenon.

Shifts in the agent's model of what counts as "a gear" or "spinning" violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?

I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that's a separate topic. I think that, if you have some value , and it doesn't run into direct conflict with any other values you have, and your model of  isn't wrong at the abstraction level it's defined at, you'll never want to change .

You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position

That's the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over "the gear that was driving the piston". After you learn it's a different gear from the one you thought, that value isn't updated: you just naturally shift it to the real gear.

Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money... but then you realize that was a mistake, and actually your friend's number is B, so you send the money there. That didn't involve any value-related shift.


I'll try again to make the human example work. Suppose you love your friend, and your model of their personality is accurate – your model of what you value is correct at the abstraction level at which "individual humans" are defined. However, there are also:

  1. Some higher-level dynamics you're not accounting for, like the impact your friend's job has on the society.
  2. Some lower-level dynamics you're unaware of, like the way your friend's mind is implemented at the levels of cells and atoms.

My claim is that, unless you have terminal preferences over those other levels, then learning to model these higher- and lower-level dynamics would have no impact on the shape of your love for your friend.

Granted, that's an unrealistic scenario. You likely have some opinions on social politics, and if you learned that your friend's job is net-harmful at the societal level, that'll surely impact your opinion of them. Or you might have conflicting same-level preferences, like caring about specific other people, and learning about these higher-level societal dynamics would make it clear to you that your friend's job is hurting them. Less realistically, you may have some preferences over cells, and you may want to... convince your friend to change their diet so that their cellular composition is more in-line with your aesthetic, or something weird like that.

But if that isn't the case – if your value is defined over an accurate abstraction and there are no other conflicting preferences at play – then the mere fact of putting it into a lower- or higher-level context won't change it.

Much like you'll never change your preferences over a gear's rotation if your model of the mechanism at the level of gears was accurate – even if you were failing to model the whole mechanism's functionality or that gear's atomic composition.

(I agree that it's a pretty contrived setup, but I think it's very valuable to tease out the specific phenomena at play – and I think "value translation" and "value conflict resolution" and "ontology crises" are highly distinct, and your model somewhat muddles them up.)

  1. ^

    Although there may be higher-level dynamics you're not tracking, or lower-level confusions. See the friend example below.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-01-04T09:45:23.042Z · LW · GW

No, I am in fact quite worried about the situation

Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.

I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures

Would you outline your full argument for this and the reasoning/evidence backing that argument?

To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).

Would you disagree? If yes, how so?

Comment by Thane Ruthenis on Natural Latents: The Math · 2024-01-02T11:49:50.037Z · LW · GW

Also, what do you mean by mutual information between , given that there are at least 3 of them?

You can generalize mutual information to N variables: interaction information.

Why would it always be possible to decompose random variables to allow for a natural latent?

Well, I suppose I overstated it a bit by saying "always"; you can certainly imagine artificial setups where the mutual information between a bunch of variables is zero. In practice, however, everything in the world is correlated with everything else, so in a real-world setting you'll likely find such a decomposition always, or almost always.

And why would just extracting said mutual information be useless? 

Well, not useless as such – it's a useful formalism – but it would basically skip everything John and David's post is describing. Crucially, it won't uniquely determine whether a specific set of objects represents a well-abstracting category.

The abstraction-finding algorithm should be able to successfully abstract over data if and only if the underlying data actually correspond to some abstraction. If it can abstract over anything, however – any arbitrary bunch of objects – then whatever it is doing, it's not finding "abstractions". It may still be useful, but it's not what we're looking for here.

Concrete example: if we feed our algorithm 1000 examples of trees, it should output the "tree" abstraction. If we feed our algorithm 200 examples each of car tires, trees, hydrogen atoms, wallpapers, and continental-philosophy papers, it shouldn't actually find some abstraction which all of these objects are instances of. But as per the everything-is-correlated argument above, they likely have non-zero mutual information, so the naive "find a decomposition for which there's a natural latent" algorithm would fail to output nothing.

More broadly: We're looking for a "true name" of abstractions, and mutual information is sort-of related, but also clearly not precisely it.

Comment by Thane Ruthenis on Natural Latents: The Math · 2024-01-01T11:01:26.839Z · LW · GW

My take would be to split each "donut" variable  into "donut size"  and "donut flavour" . Then there a natural latent for the whole  set of variables, and no natural latent for the whole  set.  basically becomes the "other stuff in the world"  variable relative to .

Granted, there's an issue in that we can basically do that for any set of variables , even entirely unrelated ones: deliberately search for some decomposition of  into an  and an  such that there's a natural latent for . I think some more practical measures could be taken into account here, though, to enure that the abstractions we find are useful. For example, we can check the relative information contents/entropies of  and , thereby measuring "how much" of the initial variable-set we're abstracting over. If it's too little, that's not a useful abstraction.[1]

That passes my common-sense check, at least. It's essentially how we're able to decompose and group objects along many different dimensions. We can focus on objects' geometry (and therefore group all sphere-like objects, from billiard balls to planets to weather balloons) or their material (grouping all objects made out of rock) or their origin (grouping all man-made objects), etc.

Each grouping then corresponds to an abstraction, with its own generally-applicable properties. E. g., deriving a "sphere" abstraction lets us discover properties like "volume as a function of radius", and then we can usefully apply that to any spherical object we discover. Similarly, man-made objects tend to have a purpose/function (unlike natural ones), which likewise lets us usefully reason about that whole category in the abstract.

(Edit: On second thoughts, I think the obvious naive way of doing that just results in  containing all mutual information between , with the "abstraction" then just being said mutual information. Which doesn't seem very useful. I still think there's something in that direction, but probably not exactly this.)

  1. ^

    Relevant: Finite Factored Sets, which IIRC offer some machinery for these sorts of decompositions of variables.

Comment by Thane Ruthenis on The Plan - 2023 Version · 2023-12-31T00:23:02.332Z · LW · GW

Yeah, I guess that block was about more concrete issues with the "humans rate things" setup? And what I've outlined is more of a... mirror of it?

Here's a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something "good" according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that's the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that's analogous to the human raters setup, right?

And then the equal-and-opposite failure mode is if you're feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into "consequentialistic utilitarianism", in a surprising and upsetting-to-you manner.

Comment by Thane Ruthenis on The Plan - 2023 Version · 2023-12-30T23:51:00.321Z · LW · GW

I have a different example in mind, from the one John provided. @johnswentworth, do mention if I'm misunderstanding what you're getting at there.

Suppose you train your AI to show respect to your ancestors. Your understanding of what this involves contains things like "preserve accurate history" and "teach the next generations about the ancestors' deeds" and "pray to the ancestors daily" and "ritually consult the ancestors before making big decisions".

  • In the standard reward-misspecification setup, the AI doesn't actually internalize the intended goal of "respect the ancestors". Instead, it grows a bunch of values about the upstream correlates of that, like "preserving accurate history" and "doing elaborate ritual dances" (or, more realistically, some completely alien variants of this). It starts to care about the correlates terminally. Then it tiles the universe with dancing books or something, with no "ancestors" mentioned anywhere in them.
  • In the "unexpected generalization" setup, the AI does end up caring about the ancestors directly. But as it learns more about the world, more than you, its ontology is updated, and it discovers that, why, actually spirits aren't real and "praying to" and "consulting" the ancestors are just arbitrary behaviors that don't have anything in particular to do with keeping the ancestors happy and respected. So the AI keeps on telling accurate histories and teaching them, but entirely drops the ritualistic elements of your culture.

But what if actually, what you cared about was preserving your culture? Rituals included, even if you learn that they don't do anything, because you still want them for the aesthetic/cultural connection?

Well, then you're out of luck. You thought you knew what you wanted, but your lack of knowledge of the structure of the domain in which you operated foiled you. And the AI doesn't care; it was taught to respect the ancestors, not be corrigible to your shifting opinions.

It's similar to the original post's example of using "zero correlation" as a proxy for "zero mutual information" to minimize information leaks. You think you know what your target is, but you don't actually know its True Name, so even optimizing for your actual not-Goodharted best understanding of it still leads to unintended outcomes.

"The AI starts to care about making humans rate its actions as good" is a particularly extreme example of it: where whatever concept the humans care about is so confused there's nothing in reality outside their minds that it corresponds to, so there's nothing for the AI to latch onto except the raters themselves.

Comment by Thane Ruthenis on The Plan - 2023 Version · 2023-12-30T01:19:39.864Z · LW · GW

Excellent breakdown of the relevant factors at play.

You Don’t Get To Choose The Problem Factorization

But what if you need to work on a problem you don't understand anyway?

That creates Spaghetti Towers: vast constructs of ad-hoc bug-fixes and tweaks built on top of bug-fixes and tweaks. Software-gore databases, Kafkaesque-horror bureaucracies, legislation you need a law degree to suffer through, confused mental models; and also, biological systems built by evolution, and neural networks trained by the SGD.

That's what necessarily, convergently happens every time you plunge into a domain you're unfamiliar with. You constantly have to tweak your system momentarily, to address new minor problems you run into, which reflects new bits of the domain structure you've learned.

Much like biology, the end result initially looks like an incomprehensible arbitrary mess, to anyone not intimately familiar with it. Much like biology, it's not actually a mess. Inasmuch as the spaghetti tower actually performs well in the domain it's deployed in, it necessarily comes to reflect that domain's structure within itself. So if you look at it through the right lens – like those of a programmer who's intimately familiar with their own nightmarish database – you'd actually be able to see that structure and efficiently navigate it.

Which suggests a way to ameliorate this problem: periodic refactoring. Every N time-steps, set some time aside for re-evaluating the construct you've created in the context of your current understanding of the domain, and re-factorize it along the lines that make sense to you now.

That centrally applies to code, yes, but also to your real-life projects, and your literal mental ontologies/models. Always make sure to simplify and distill them. Hunt down snippets of redundant code and unify them into one function.

I. e.: When working on a problem you don't understand, make sure to iterate on the problem factorization.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2023-12-29T21:33:34.680Z · LW · GW

I don't necessarily see how that prevents the AI from destroying the button and murdering its operators (and then everyone else). Neither subagent inherently cares; both assume it already has been pressed or hasn't been. Which, yes, means neither will deliberately plot to interact with it for reasons downstream of this setup; but also, neither will care to protect it from being destroyed as the side-effect of some other activity.

And if u1 and u2 were perfectly aligned, that'd be fine. But the reason we're using the corrigibility setup to begin with is because we're not confident in them, right? So it needs to work in situations where utility functions, u1 in particular, recommend bad stuff.

And in that case, either Subagent 2 will have enough bargaining power to prevent Subagent 1 from ever doing anything, or it won't, meaning Subagent 1 will be able to have some influence on the world. At which point the planet-destroying death rays argument applies: the most minuscule amount of superintelligent optimization not carefully shaped to avoid killing everyone will kill everyone.

Like, imagine that Subagent 1 is straight-up omnicidal out of the gates, and Subagent 2 is perfectly aligned to the shutdown task. That's a situation in which this setup should work, right? And what we'd want is for Subagent 2 to dominate completely. But Subagent 2's bargaining power doesn't scale with Subagent 1's misalignment, so if the setup doesn't always give it complete veto power, Subagent 1 would be allowed to do small stuff like "release the hyperviral worship-the-machine-god memagent" (as bargained down by Subagent 2 from its preferred "release the grey goo").

Maybe I'm miscalibrated as to how you're imagining the counterfactuals to work here. But the way I see it, even if the AI aims to take actions that do similarly well in both worlds, that's not necessarily survivable for us? Especially if they disagree so badly they have to compromise on something that both of them hate (importantly including Subagent 2!).

(Like both settling on only ever taking over 50% of the universal negentropy while leaving the other 50% causally uninfluenced, or only ever using 50% of the causal influence they can bring to bear while wiping out humanity, or whatever "do 50% of immediately shutting down" shakes out to mean by u2's terms.)


Another issue I see is implementational, so maybe not what you're looking for. But: how are we keeping these "subagents" trapped as being part of a singular agent? Rather than hacking their way out into becoming separate agents and going to war with each other, or neatly tiling exactly 50% of the cosmos with their preferred squiggles, or stuff like that? How is the scenario made meaningfully different from "we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI's plans", with all the standard problems with multi-AI setups?

... Overall, ironically, this kind of has the vibe of Godzilla Strategies? Which is the main reason I'm immediately skeptical of it.

Comment by Thane Ruthenis on Value systematization: how values become coherent (and misaligned) · 2023-12-28T22:49:19.973Z · LW · GW

Yeah, I'm familiar with that view on Friston, and I shared it for a while. But it seems there's a place for that stuff after all. Even if the initial switch to viewing things probabilistically is mathematically vacuous, it can still be useful: if viewing cognition in that framework makes it easier to think about (and thus theorize about).

Much like changing coordinates from Cartesian to polar is "vacuous" in some sense, but makes certain problems dramatically more straightforward to think through.

Comment by Thane Ruthenis on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-27T01:05:39.526Z · LW · GW

Although interestingly geometric EU-maximising is actually equivalent to minimising H(u,p)/making the real distribution similar to the target

Mind elaborating on that? I'd played around with geometric EU maximization, but haven't gotten a result this clean.

Comment by Thane Ruthenis on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-26T22:41:46.221Z · LW · GW

If any of the others are particularly enthusiastic about this and expect it to be high-value, sure!

That said, I personally don't expect it to be particularly productive.

  • These sorts of long-standing disagreements haven't historically been resolvable via debate (the failure of Hanson vs. Yudkowsky is kind of foundational to the field).
  • I think there's great value in having a public discussion nonetheless, but I think it's in informing the readers' models of what different sides believe.
  • Thus, inasmuch as we're having a public discussion, I think it should be optimized for thoroughly laying out one's points to the audience.
  • However, dialogues-as-a-feature seem to be more valuable to the participants, and are actually harder to grok for readers.
  • Thus, my preferred method for discussing this sort of stuff is to exchange top-level posts trying to refute each other (the way this post is, to a significant extent, a response to the AI is easy to control article), and then maybe argue a bit in the comments. But not to have a giant tedious top-level argument.

I'd actually been planning to make a post about the difficulties the "classical alignment views" have with making empirical predictions, and I guess I can prioritize it more?

But I'm overall pretty burned out on this sort of arguing. (And arguing about "what would count as empirical evidence for you?" generally feels like too-meta fake work, compared to just going out and trying to directly dredge up some evidence.)