Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities 2024-02-02T05:49:11.189Z
FAQ: What the heck is goal agnosticism? 2023-10-08T19:11:50.269Z
A plea for more funding shortfall transparency 2023-08-07T21:33:11.912Z
Using predictors in corrigible systems 2023-07-19T22:29:02.742Z
One path to coherence: conditionalization 2023-06-29T01:08:14.527Z
One implementation of regulatory GPU restrictions 2023-06-04T20:34:37.090Z
porby's Shortform 2023-05-24T21:34:26.211Z
Implied "utilities" of simulators are broad, dense, and shallow 2023-03-01T03:23:22.974Z
Instrumentality makes agents agenty 2023-02-21T04:28:57.190Z
How would you use video gamey tech to help with AI safety? 2023-02-09T00:20:34.152Z
Against Boltzmann mesaoptimizers 2023-01-30T02:55:12.041Z
FFMI Gains: A List of Vitalities 2023-01-12T04:48:04.378Z
Simulators, constraints, and goal agnosticism: porbynotes vol. 1 2022-11-23T04:22:25.748Z
Am I secretly excited for AI getting weird? 2022-10-29T22:16:52.592Z
Why I think strong general AI is coming soon 2022-09-28T05:40:38.395Z
Private alignment research sharing and coordination 2022-09-04T00:01:22.337Z


Comment by porby on porby's Shortform · 2024-02-09T18:29:45.391Z · LW · GW

Yup, exactly the same experience here.

Comment by porby on porby's Shortform · 2024-02-06T22:53:42.556Z · LW · GW

Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?

A simple example:

  1. Simultaneously train task A and task B for N steps.
  2. Stop training task B, but continue to evaluate the performance of both A and B.
  3. Observe how rapidly task B performance degrades.

Repeat across scale and regularization strategies.

Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).

I've previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.

The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that's intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?

I'd also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.

Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.

Comment by porby on porby's Shortform · 2024-02-02T19:37:22.778Z · LW · GW

A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.

On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.

Comment by porby on porby's Shortform · 2024-02-02T19:31:35.874Z · LW · GW

Having escaped infinite overtime associated with getting the paper done, I'm now going back and catching up on some stuff I couldn't dive into before.

Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can't identify.

I alluded to this in the paper linkpost, but soft prompts are a very simple and very strong option for this. There remains a difficulty in figuring out what unwanted behavior to adversarially elicit, but this is an area that has a lot of low hanging fruit.

I'd also interested in whether how more brute force interventions, like autoregressively detuning a backdoored model with a large soft prompt for a very large dataset (or an adversarially chosen anti-backdoor dataset) compares to the other SFT/RL interventions. Activation steering, too; I'm currently guessing activation-based interventions are the cheapest for this sort of thing.

Comment by porby on Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities · 2024-02-02T05:51:30.408Z · LW · GW

By the way: I just got into San Francisco for EAG, so if anyone's around and wants to chat, feel free to get in touch on swapcard (or if you're not in the conference, perhaps a DM)! I fly out on the 8th.

Comment by porby on Why I think strong general AI is coming soon · 2023-12-16T23:21:49.918Z · LW · GW

It's been over a year since the original post and 7 months since the openphil revision.

A top level summary:

  1. My estimates for timelines are pretty much the same as they were.
  2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).

And, while I don't think this is the most surprising outcome nor the most critical detail, it's probably worth pointing out some context. From NVIDIA:

In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.

From the post:

In 3 years, if NVIDIA's production increases another 5x ...

Revenue isn't a perfect proxy for shipped compute, but I think it's safe to say we've entered a period of extreme interest in compute acquisition. "5x" in 3 years seems conservative.[1] I doubt the B100 is going to slow this curve down, and competitors aren't idle: AMD's MI300X is within striking distance, and even Intel's Gaudi 2 has promising results.

Chip manufacturing remains a bottleneck, but it's a bottleneck that's widening as fast as it can to catch up to absurd demand. It may still be bottlenecked in 5 years, but not at the same level of production.

On the difficulty of intelligence

I'm torn about the "too much intelligence within bounds" stuff. On one hand, I think it points towards the most important batch of insights in the post, but on the other hand, it ends with an unsatisfying "there's more important stuff here! I can't talk about it but trust me bro!"

I'm not sure what to do about this. The best arguments and evidence are things that fall into the bucket of "probably don't talk about this in public out of an abundance of caution." It's not one weird trick to explode the world, but it's not completely benign either.

Continued research and private conversations haven't made me less concerned. I do know there are some other people who are worried about similar things, but it's unclear how widely understood it is, or whether someone has a strong argument against it that I don't know about.

So, while unsatisfying, I'd still assert that there are highly accessible paths to broadly superhuman capability on short timescales. Little of my forecast's variance arises from uncertainty on this point; it's mostly a question of when certain things are invented, adopted, and then deployed at sufficient scale. Sequential human effort is a big chunk; there are video games that took less time to build than the gap between this post's original publication date and its median estimate of 2030.

On doom

When originally writing this, my model of how capabilities would develop was far less defined, and my doom-model was necessarily more generic.

A brief summary would be:

  1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn't route through the world.
  2. It's remarkably easy to elicit this form of extreme capability to guide itself. This isn't some incidental detail; it arises from the core process that the model learned to implement.
  3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It's not a sparse/distant reward target; it is a profoundly constraining and informative target.

I've written more on the nice properties of some of these architectures elsewhere. I'm in the process of writing up a complementary post on why I think these properties (and using them properly) are an attractor in capabilities, and further, why I think some of the x-riskiest forms of optimization process are actively repulsive for capabilities. This requires some justification, but alas, the post will have to wait some number of weeks in the queue behind a research project.

The source of the doom-update is the correction of some hidden assumptions in my doom model. My original model was downstream of agent foundations-y models, but naive. It followed a process: set up a framework, make internally coherent arguments within that framework, observe highly concerning results, then neglect to notice where the framework didn't apply.

Specifically, some of the arguments feeding into my doom model were covertly replacing instances of optimizers with hypercomputer-based optimizers[2], because hey, once you've got an optimizer and you don't know any bounds on it, you probably shouldn't assume it'll just turn out convenient for you, and hypercomputer-optimizers are the least convenient.

For example, this part:

Is that enough to start deeply modeling internal agents and other phenomena concerning for safety?

And this part:

AGI probably isn't going to suffer from these issues as much. Building an oracle is probably still worth it to a company even if it takes 10 seconds for it to respond, and it's still worth it if you have to double check its answers (up until oops dead, anyway).

With no justification, I imported deceptive mesaoptimizers and other "unbound" threats. Under the earlier model, this seemed natural.

I now think there are bounds on pretty much all relevant optimizing processes up and down the stack from the structure of learned mesaoptimizers to the whole capability-seeking industry. Those bounds necessarily chop off large chunks of optimizer-derived doom; many outcomes that previously seemed convergent to me now seem extremely hard to access.

As a result, "technical safety failure causes existential catastrophe" dropped in probability by around 75-90%, down to something like 5%-ish.[3]

I'm still not sure how to navigate a world with lots of extremely strong AIs. As capability increases, outcome variance increases. With no mitigations, more and more organizations (or, eventually, individuals) will have access to destabilizing systems, and they would amplify any hostile competitive dynamics.[4] The "pivotal act" frame gets imported even if none of the systems are independently dangerous.

I've got hope that my expected path of capabilities opens the door for more incremental interventions, but there's a reason my total P(doom) hasn't yet dropped much below 30%.

  1. ^

    The reason why this isn't an update for me is that I was being deliberately conservative at the time.

  2. ^

    A hypercomputer-empowered optimizer can jump to the global optimum with brute force. There isn't some mild greedy search to be incrementally shaped; if your specification is even slightly wrong in a sufficiently complex space, the natural and default result of a hypercomputer-optimizer is infinite cosmic horror.

  3. ^

    It's sometimes tricky to draw a line between "oh this was a technical alignment failure that yielded an AI-derived catastrophe, as opposed to someone using it wrong," so it's hard to pin down the constituent probabilities.

  4. ^

    While strong AI introduces all sorts of new threats, its generality amplifies "conventional" threats like war, nukes, and biorisk, too. This could create civilizational problems even before a single AI could, in principle, disempower humanity.

Comment by porby on AI Views Snapshots · 2023-12-13T23:41:22.618Z · LW · GW


My answer to "If AI wipes out humanity and colonizes the universe itself, the future will go about as well as if humanity had survived (or better)" is pretty much defined by how the question is interpreted. It could swing pretty wildly, but the obvious interpretation seems ~tautologically bad.

Comment by porby on porby's Shortform · 2023-12-13T20:49:11.875Z · LW · GW

I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

Comment by porby on Suggestions for net positive LLM research · 2023-12-13T20:45:52.295Z · LW · GW

I'm accumulating a to-do list of experiments much faster than my ability to complete them:

  1. Characterizing fine-tuning effects with feature dictionaries
  2. Toy-scale automated neural network decompilation (difficult to scale)
  3. Trying to understand evolution of internal representational features across blocks by throwing constraints at it 
  4. Using soft prompts as a proxy measure of informational distance between models/conditions and behaviors (see note below)
  5. Prompt retrodiction for interpreting fine tuning, with more difficult extension for activation matching
  6. Miscellaneous bunch of experiments

If you wanted to take one of these and run with it or a variant, I wouldn't mind!

The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.

Note: I've already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they'd like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.

Further note: I haven't done a deep dive on all relevant literature; it could be that some of these have already been done somewhere!  (If anyone happens to know of prior art for any of these, please let me know.)

Comment by porby on porby's Shortform · 2023-12-11T02:55:35.597Z · LW · GW

Retrodicting prompts can be useful for interpretability when dealing with conditions that aren't natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.

What does a prompt retrodictor look like?

Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there's nothing special in principle about soft prompts with regard to their impact on conditioning predictions.

Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.

Two obvious approaches:

  1. Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be: 
    [Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
    Or, as the paper points out: 
    [Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token] Nothing stopping the prefix sequence from having zero length.
  2. Could also specialize training for just previous prediction: 
    [Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]

But we don't just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix's activations.

This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt)), where (activations | sourcePrompt) are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there's no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.

Note that retrodicting with an activation objective has some downsides:

  1. If the retrodictor's the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
  2. Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
  3. While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It'll likely still have some kind of interpretable "vibe," assuming the model isn't too aggressively exploitable.

This class of experiment is expensive for natural language models. I'm not sure how interesting it is at scales realistically trainable on a couple of 4090s.

Comment by porby on porby's Shortform · 2023-12-11T00:04:41.219Z · LW · GW

Another potentially useful metric in the space of "fragility," expanding on #4 above:

The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.

This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.

Does internal representational fragility correlate with other notions of "fragility," like the information-required-to-induce-behavior "fragility" in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?

Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.

If anything, I'd expect anticorrelation; well-learned regions probably have enough training constraints that they've been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.

That'd still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.

Comment by porby on porby's Shortform · 2023-12-10T22:47:36.252Z · LW · GW

A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.

Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.

In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That'd be good to know!

This kind of adversarial prompt automation could also be trivially included in an evaluations program.

I can't imagine that this hasn't been done before. If anyone has seen something like this, please let me know.

Comment by porby on porby's Shortform · 2023-12-10T22:38:47.691Z · LW · GW

Expanding on #6 from above more explicit, since it seems potentially valuable:

From the goal agnosticism FAQ:

The definition as stated does not put a requirement on how "hard" it needs to be to specify a dangerous agent as a subset of the goal agnostic system's behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.

From earlier experimentpost:

Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.

This can be phrased as "what's the amount of information required to push a model into behavior X?"

Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:

"What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?"

In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:

Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.

This seems like... it's... an extremely good answer to the "fragility" question? It's trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.

Conceptually, it's a quantification of the number of information theoretic mistakes you'd need to make to get bad behavior from the model.

Comment by porby on porby's Shortform · 2023-12-10T22:23:02.514Z · LW · GW

Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.

Does training the model to recognize properties (e.g. 'niceness') explicitly as metatokens via classification make soft prompts better at capturing those properties?

You could test for that explicitly:

  1. Pretrain model A with metatokens with a classifier.
  2. Pretrain model B without metatokens.
  3. Train soft prompts on model A with the same classifier.
  4. Train soft prompts on model B with the same classifier.
  5. Compare performance of soft prompts in A and B using the classifier.

Notes and extensions:

  1. The results of the research are very likely scale sensitive. As the model gets larger, many classifier-relevant distinctions that could be missed by small models lacking metatoken training may naturally get included. In the limit, the metatoken training contribution may become negligible. Is this observable across ~pythia scales? Could do SFT on pythia to get a "model A."
  2. The above description leaves out some complexity. Ideally, the classifier could give scalar scores. This requires scalarized input tokens for the model that pretrains with metatokens.
  3. How does soft prompting work when tokens are forced to be smaller? For example, if each token is a character, it'll likely have a smaller residual dedicated to it compared to tokens that spans ~4 characters to equalize total compute.
  4. To what degree does soft prompting verge on a kind of "adversarial" optimization? Does it find fragile representations where small perturbations could produce wildly different results? If so, what kinds of regularization are necessary to push back on that, and what is the net effect of that regularization?
  5. There's no restriction on the nature of the prompt. In principle, the "classifier" could be an RL-style scoring mechanism for any reward. How many tokens does it take to push a given model into particular kinds of "agentic" behavior? For example, how many tokens does it take to encode the prompt corresponding to "maximize the accuracy of the token prediction at index 32 in the sequence"?
  6. More generally: the number of tokens required to specify a behavior could be used as a metric for the degree to which a model "bakes in" a particular functionality. More tokens required to specify behavior successfully -> more information required in that model to specify that behavior.
Comment by porby on porby's Shortform · 2023-12-09T20:42:53.589Z · LW · GW

Quarter-baked experiment:

  1. Stick a sparse autoencoder on the residual stream in each block.
  2. Share weights across autoencoder instances across all blocks.
  3. Train autoencoder during model pretraining.
  4. Allow the gradients from autoencoder loss to flow into the rest of the model.

Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:

  1. Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block's feature to "mean" the same thing as a later block's same feature when they're at different parts of execution?
  2. How much does a shared representation across all blocks harm performance? Getting the comparison right is subtle; it would be quite surprising if there is no slowdown on predictive training when combined with the autoencoder training since they're not necessarily aligned. Could try training very small models to convergence to see if they have different plateaus.
  3. If forcing a shared representation doesn't harm performance, why not? In principle, blocks can execute different sorts of programs with different IO. Forcing the residual stream to obey a format that works for all blocks without loss would suggest that there were sufficient representational degrees of freedom remaining (e.g. via superposition) to "waste" some when the block doesn't need it. Or the shared "features" mean something completely different at different points in execution.
  4. Compare the size of the dictionary required to achieve a particular specificity of feature between the shared autoencoder and a per-block autoencoder. How much larger is the shared autoencoder? In the limit, it could just be BlockCount times larger with some piece of the residual stream acting as a lookup. It'd be a little surprising if there was effectively no sharing.
  5. Compare post-trained per-block autoencoders against per-block autoencoders embedded in pretraining that allow gradients to flow into the rest of the model. Are there any interesting differences in representation? Maybe in terms of size of dictionary relative to feature specificity? In other words, does pretraining the feature autoencoder encourage a more decodable native representation?
  6. Take a look at the decoded features across blocks. Can you find a pattern for what features are relevant to what blocks? (This doesn't technically require having a shared autoencoder, but having a single shared dictionary makes it easier to point out when the blocks are acting on the same feature, rather than doing an investigation, squinting, and saying "yeah, that sure looks similar.")
Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-30T18:44:20.907Z · LW · GW

I think that'd be great!

Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.

I suspect there's a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point. 

Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T19:18:37.592Z · LW · GW

What I'm calling a simulator (following Janus's terminology) you call a predictor

Yup; I use the terms almost interchangeably. I tend to use "simulator" when referring to predictors used for a simulator-y use case, and "predictor" when I'm referring to how they're trained and things directly related to that.

I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.

Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.) 

Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T19:00:34.191Z · LW · GW

Alas, nope! To my knowledge it hasn't actually been tried at any notable scale; it's just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.

Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T04:06:27.588Z · LW · GW

Signal boosted! This is one of those papers that seems less known that it should be. It's part of the reason why I'm optimistic about dramatic increases in the quality of "prosaic" alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it's part of a path that's robust enough to scale.

You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.

It's also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.

The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice> versus <authoritative_tone> versus <correct>. It's an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I'm pretty certain that a model that has very thoroughly learned what "nice" means at the human level can meaningfully generalize it to contexts where it hasn't seen it directly applied.[1]

I'm also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn't be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens. 

  1. ^

    After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that "embodies the aspirational human trait of being kind to one another." That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn't be okay with, say, a superscience plan that would blow up 25% of the earth's crust.

    Generated by DALL·E 

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T02:10:28.495Z · LW · GW

I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

Hm, I'm sufficiently surprised at this claim that I'm not sure that I understand what you mean. I'll attempt a response on the assumption that I do understand; apologies if I don't:

I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.

A common form is to be a mapping between inputs and outputs that isn't swayed by anything outside of the context of that mapping (which I'll term "external world states"). You can view a calculator as a coherent agent, but you can't usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator's process.

You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn't change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.

I've been using the following two requirements to point at a maximally[1] tool-like set of agents. This composes what I've been calling goal agnosticism:

  1. The agent cannot be usefully described[2] as having unconditional preferences about external world states.
  2. Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.   

Note that this isn't the same thing as a definition for "tool." An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like "proper" agents.

To phrase it another way, the intuitive degree of "toolness" is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.

Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn't be surprised if it doesn't strictly obey the definition anymore, but it's close enough along the spectrum that it still feels intuitive to call it a tool.

Further, just like in the case of the calculator, you can easily build a system around a goal agnostic "tool" LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent's properties.[3]

  1. ^

    For one critical axis in the toolishness basis, anyway.

  2. ^

    Tricky stuff like having a bunch of terms regarding external world states that just so happen to always cancel don't count.

  3. ^

    This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T01:05:43.508Z · LW · GW

While this probably isn't the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:

I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.

Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnostic systems that aren't autodoomy. The transition out of goal agnosticism is not something I expect to avoid, nor something that I think should be avoided.

I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won't be enough to stop catastrophe once someone has defected.

I'd be more worried about this if I thought the path was something that required Virtuous Sacrifice to maintain. In practice, the reason I'm as optimistic (nonmaximally pessimistic?) as I am that I think there are pretty strong convergent pressures to stay on something close enough to the non-autodoom path.

In other words, if my model of capability progress is roughly correct, then there isn't a notably rewarding option to "defect" architecturally/technologically that yields greater autodoom.

With regard to other kinds of defection:

I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don't see how to have much hope for humanity.

Yup! Goal agnosticism doesn't directly solve misuse (broadly construed), which is part of why misuse is ~80%-ish of my p(doom).

And I also don't see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?

If we muddle along deeply enough into a critical risk period slathered in capability overhangs that TurboDemon.AI v8.5 is accessible to every local death cult and we haven't yet figured out how to constrain their activity, yup, that's real bad.

Given my model of capability development, I think there are many incremental messy opportunities to act that could sufficiently secure the future over time. Given the nature of the risk and how it can proliferate, I view it as much harder to handle than nukes or biorisk, but not impossible.

Comment by porby on porby's Shortform · 2023-11-28T00:41:51.558Z · LW · GW

Another experiment:

  1. Train model M.
  2. Train sparse autoencoder feature extractor for activations in M.
  3. FT = FineTune(M), for some form of fine-tuning function FineTune.
  4. For input x, fineTuningBias(x) = FT(x) - M(x)
  5. Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
  6. Backpropagate the loss through M(x) into the feature dictionaries.
  7. Identify responsible features by large gradients.
  8. Identify what those features represent (manually or AI-assisted).
  9. To what degree do those identified features line up with the original FineTune function's intent?


  1. The features above are in the context of a single input. Check for larger scopes by sampling more inputs, backpropagating, and averaging the observed feature activations. Check for ~unconditional shifts induced by FineTune by averaging over an extremely broad sampling of inputs.
  2. Can check path dependence during RLHF-like fine tuning. Do the features modified across multiple RLHF runs remain similar? Note that this does not require interpreting what features represent, just that they differ. That makes things easier! (Also, note that this doesn't technically require a feature dictionary; the sparse autoencoder bit just makes it easier to reason about the resulting direction.)
  3. Can compare representations learned by decision transformers versus PPO-driven RLHF. Any difference between the features affected? Any difference in the degree of path dependence?
  4. Can compare other forms of conditioning. Think [2302.08582] Pretraining Language Models with Human Preferences ( In this case, there wouldn't really be a fine-tuning training stage; rather, the existence of the condition would serve as the runtime FineTune function. Compare the features between the conditioned and unconditioned cases. Presence of the conditions in pretraining could change the expressed features, but that's not a huge problem. 
  5. Any way to meaningfully compare against activation steering? Given that the analysis is based directly on the activations to begin with, it would just be a question of where the steering vector came from. The feature dictionary could be used to build a steering vector, in principle.
  6. Does RLHF change the feature dictionary? On one hand, conditioning-equivalent RL (with KL divergence penalty) shouldn't find new sorts of capability-relevant distinctions, but it's very possible that it collapses some features that are no longer variable in the fine-tuned model. This is trickier to evaluate; could try to train a linear map on the activations of model B before feeding it to an autoencoder trained on model A's activations.  
Comment by porby on porby's Shortform · 2023-11-27T22:42:01.970Z · LW · GW

Some experimental directions I recently wrote up; might as well be public:

  1. Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
  2. A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motivated instrumental behavior across the training distribution; if it's possible for any nontrivial capability to persist out of distribution at toy scales, even with significant contrivance to train it into existence in the first place, that would be extremely concerning for the potential persistence of deceptive mesaoptimizers at scale.

    Ideally, the experiment would examine the difference between OOD capabilities with varying levels of overlap with the training distribution. For example, contrast four cases:
    A: A model is trained on ten different "languages" with zero translation tasks between them. These "languages" would be not human languages, but rather trivial types of sequences that do not share any obvious form or underlying structure. One language could be the sequence generated by f(x) = 2x + 1; another might be to endlessly repeat "brink bronk poot toot."
    B: A model is trained on ten different languages with significantly different form, but a shared underlying structure. For example, all the languages might involve solving trivial arithmetic, but one language is "3 + 4 = 7" and another language is "three plus four equals seven."
    C: Same as B, but now give the model translation tasks.
    D: Same as C, but leave one language pair's translation tasks unspecified. Any successful translation for that pair would necessarily arise from a generalizing implementation.

    For each model, drop parts of the training distribution but continue to perform test evaluations on that discontinued part. Do models with more apparent shared implementation decay more slowly? How does the decay vary with hyperparameters?

    Some circuit-level analysis might be helpful here to identify whether capability is lost via trivial gating versus catastrophic scrambling, but it's probably best to punt that to a separate experiment.
  3. I suspect there is an equivalence between conditioning and representational intervention, like activation steering. They may be different interfaces to the same effect. I'd like to poke around metatoken-like approaches (like Pretraining Language Models with Human Preferences) and see if I can find anything compelling from a representational perspective.
  4. Assuming goal agnosticism is actually achieved and maintained, it broadens the scope for what kinds of interpretability can be useful by ruling out internal representational adversaries. There may be room for more experiments around motivational interpretability. (Some other work has already been published on special cases.)

Less concretely, I'd also like to:

  1. Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
  2. More fully ground "Responsible Scaling Policy"-style approaches on a goal agnostic foundation. If a lab can demonstrate that a model is incapable of learning preferences over external world states, and that their method of aiming the model isn't "fragile" in the above sense, then it's a good candidate for incremental experimentation.
  3. Come up with other ways to connect this research path with policy more generally.
Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T18:22:45.419Z · LW · GW

In retrospect, the example I used was poorly specified. It wouldn't surprise me if the result of the literal interpretation was "the AI refuses to play chess" rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn't significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn't actually the most reliable accessible strategy to "never lose at chess" for that broader type of system and I'd expect superior strategies to be found in the limit of optimization.

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T18:11:01.432Z · LW · GW

But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

My attempt at an ITT-response:

Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—could still be well-described by a concerning kind of wanting.

Trivially, being better at achieving goals makes achieving goals easier, so there's pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there's a system with dangerous optimization power.

(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don't know if I'm reproducing opposing arguments faithfully and part of the reason I'm trying is to see if someone can correct/improve on them.)

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-25T18:05:02.916Z · LW · GW

Trying to respond in what I think the original intended frame was:

A chess AI's training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn't clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as "wanting" to win within the boundaries of the space that it operates.

In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist "values" the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.

(For background, I think it's relatively easy to relocate where the optimization squeezing is happening to avoid this sort of worldeating outcome, but it remains true that optimization for targets with ill-defined bounds is spooky and to be avoided.)

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-25T00:22:16.102Z · LW · GW

you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?

Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.

A subset of predictors are indeed the most powerful known goal agnostic systems. I can't currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend beyond predictors, so I leave the door open.

Also, by using the term "goal agnosticism" I try to highlight the value that arises directly from the goal-related properties, like statistical passivity and the lack of instrumental representational obfuscation. I could just try to use the more limited and implementation specific "ideal predictors" I've used before, but in order to properly specify what I mean by an "ideal" predictor, I'd need to specify goal agnosticism.

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-24T23:45:51.681Z · LW · GW

I'm not sure if I fall into the bucket of people you'd consider this to be an answer to. I do think there's something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.

In case it's informative, here's how I'd respond to this:

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Mostly agreed, with the capability-related asterisk.

Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one's plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.

Agreed in the spirit that I think this was meant, but I'd rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn't.

That's subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.

If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I'll say it "wants" that outcome “in the behaviorist sense”.

I think this frame is reasonable, and I use it.

it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.


that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.


“AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.

Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.

 So, maybe don't make those generalized wrench-removers just yet, until we do know how to load proper targets in there.

Agreed, don't make the runaway misaligned optimizer.

I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:

  1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn't route through the world.
  2. It's remarkably easy to elicit this form of extreme capability to guide itself. This isn't some incidental detail; it arises from the core process that the model learned to implement.
  3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It's not a sparse/distant reward target; it is a profoundly constraining and informative target.

In other words, a big part of the update for me was in having a real foothold on loading the full complexity of "proper targets."

I don't think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn't rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.

Comment by porby on What's the evidence that LLMs will scale up efficiently beyond GPT4? i.e. couldn't GPT5, etc., be very inefficient? · 2023-11-24T21:23:33.124Z · LW · GW

This isn't directly evidence, but I think it's worth flagging: by the nature the topic, much of the most compelling evidence is potentially hazardous. This will bias the kinds of answers you can get.

(This isn't hypothetical. I don't have some One Weird Trick To Blow Up The World, but there's a bunch of stuff that falls under the policy "probably don't mention this without good reason out of an abundance of caution.")

Comment by porby on TurnTrout's shortform feed · 2023-11-23T21:25:32.943Z · LW · GW

For what it's worth, I've had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That's a comparison between a poor python implementation and a vectorized C# implementation, but... yeah.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-14T02:41:08.545Z · LW · GW

…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).

Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.

Or to phrase it another way, suppose I don't like eating bread[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).

You ask me to choose between garlic bread (1000 - 1 = 999 utilons) and cheese (100 utilons); I pick the garlic bread.

The fact that I don't like bread isn't erased by the fact that I chose to eat garlic bread in this context.

It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes.

This is aiming at a different problem than goal agnosticism; it's trying to come up with an agent that is reasonably safe in other ways.

In order for these kinds of bounds (curiosity, nausea) to work, they need to incorporate enough of the human intent behind the concepts.

So perhaps there is an interpretation of those words that is helpful, but there remains the question "how do you get the AI to obey that interpretation," and even then, that interpretation doesn't fit the restrictive definition of goal agnosticism.

The usefulness of strong goal agnostic systems (like ideal predictors) is that, while they do not have properties like those by default, they make it possible to incrementally implement those properties.

  1. ^

    utterly false for the record

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-10T00:01:04.548Z · LW · GW

For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.

Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn't be terribly surprising for it to, say, set up low-interference observation mechanisms.

Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.

Comment by porby on TurnTrout's shortform feed · 2023-11-09T23:43:35.301Z · LW · GW

I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.


I expect that if the mainstream AI researchers do make strides in the direction you're envisioning, they'll only do it by coincidence. Then probably they won't even realize what they've stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That's basically what already happened with GPT-4, to @janus' dismay.)

Yup—this is part of the reason why I'm optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong. 

Now, the architectural pieces subject to similar flailing is much smaller, and I'm guessing we're only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.

In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.

That said, you're making some high-quality novel predictions here, and I'll keep them in mind when analyzing AI advancements going forward.

Thanks, and thanks for engaging!

Come to think of it, I've got a chunk of mana laying around for subsidy. Maybe I'll see if I can come up with some decent resolution criteria for a market.

Comment by porby on TurnTrout's shortform feed · 2023-11-08T21:54:51.615Z · LW · GW

I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"

That's closer to what I mean, but these constraints are even lower level than that. Stuff like understanding "gravity exists" is a natural internal implementation that meets some constraints, but "gravity exists" is not itself the constraint.

In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn't relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.

But since they're not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints.

I'd agree that, if you already have an AGI of that shape, then yes, it'll do that. I'd argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.

Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.

Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that "tests" constraints in a way that worsens loss is directly reshaped.

Even if a nascent internal AGI of this type develops, if it isn't yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.

Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.

These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn't even bother with trying to slip constraints. It's just not that kind of machine, and there isn't a convergent path for it to become that kind of machine under this training mechanism.

And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.

Does that align with what you're envisioning? If yes, then our views on the issue are surprisingly close. I think it's one of our best chances at producing an aligned AI, and it's one of the prospective targets of my own research agenda.


I don't think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.

I agree that they're focused on inducing agentiness for usefulness reasons, but I'd argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.

This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn't work.

What are the "other paths" you're speaking of? As you'd pointed out, prompts are a weak and awkward way to run custom queries on the AI's world-model. What alternatives are you envisioning?

I'm pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.

A simple example is [2302.08582] Pretraining Language Models with Human Preferences ( Having a "good" and "bad" token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF's strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There's no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).[1]

Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don't know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.

  1. ^

    There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole "we suck at designing rewards that result in what we want" issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model's own capability to guide its behavior. I bet we can come up with even better implementations, too.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-08T21:00:02.662Z · LW · GW

Probably not? It's tough to come up with an interpretation of those properties that wouldn't result in the kind of unconditional preferences that break goal agnosticism.

Comment by porby on TurnTrout's shortform feed · 2023-11-07T19:28:25.624Z · LW · GW

I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic".

Alright, this is pretty much the same concept then, but the ones I'm referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.


Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.


... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can't somehow slip these constraints won't be a general intelligence.

While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don't see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.

By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.

While it's true that an AI probably isn't going to learn true things which are utterly divorced from and unimplied by the training distribution, I'd argue that the low-level constraints I'm talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An "ideal predictor" wouldn't automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.

Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn't need to involve slipping any of the low-level constraints.

I'm guessing the disconnect between our models is where the aiming happens. I'm proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.

If the result of refinement isn't an incorrigible maximizer, then slipping the higher level "constraints" of this aiming process isn't convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.

In fact, my model says there's no fundamental typological difference between "a practical heuristic on how to do a thing" and "a value" at the level of algorithmic implementation. It's only in the cognitive labels we-the-general-intelligences assign them.

That's pretty close to how I'm using the word "value" as well. Phrased differently, it's a question of how the agent's utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.

Comment by porby on TurnTrout's shortform feed · 2023-11-07T04:32:06.225Z · LW · GW

I think we're using the word "constraint" differently, or at least in different contexts.

Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?

In terms of the type and scale of optimization constraint I'm talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sorts of complicated feedback loops in our massive multiagent environment—but it's nothing like the value constraints on the subset of predictors I'm talking about.

To be clear, I'm not suggesting "language models are tuned to be fairly close to our values." I'm making a much stronger claim that the relevant subset of systems I'm referring to cannot express unconditional values over external world states across anything resembling the training distribution, and that developing such values out of distribution in a coherent goal directed way practically requires the active intervention of a strong adversary. In other words:

A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it's able to disregard them.


These constraints do not generalize as fast as a homunculus' understanding goes.

I see no practical path for a homunculus of the right kind, by itself, to develop and bypass the kinds of constraints I'm talking about without some severe errors being made in the design of the system.

Further, this type of constraint isn't the same thing as a limitation of capability. In this context, with respect to the training process, bypassing these kinds of constraints is kind of like a car bypassing having-a-functioning-engine. Every training sample is a constraint on what can be expressed locally, but it's also information about what should be expressed. They are what the machine of Bayesian inference is built out of.

In other words, the hard optimization process is contained to a space where we can actually have reasonable confidence that inner alignment with the loss is the default. If this holds up, turning up the optimization on this part doesn't increase the risk of value drift or surprises, it just increases foundational capability.

The ability to use that capability to aim itself is how the foundation becomes useful. The result of this process need not result in a coherent maximizer over external world states, nor does it necessarily suffer from coherence death spirals driving it towards being a maximizer. It allows incremental progress.

(That said: this is not a claim that all of alignment is solved. These nice properties can be broken, and even if they aren't, the system can be pointed in catastrophic directions. An extremely strong goal agnostic system like this could be used to build a dangerous coherent maximizer (in a nontrivial sense); doing so is just not convergent or particularly useful.)

Comment by porby on TurnTrout's shortform feed · 2023-11-06T20:31:59.810Z · LW · GW

My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.

My reasoning would be that we're bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don't produce the thing we want. When this kind of optimization is pushed too far, it's not merely dangerous; it's useless.

I don't think this is temporary ignorance about how to do RL (or things with similar training dynamics). It's fundamental:

  1. Sparse and distant reward functions in high dimensional spaces give the optimizer an extremely large space to roam. Without bounds, the optimizer is effectively guaranteed to find something weird.
  2. For almost any nontrivial task we care about, a satisfactory reward function takes a dependency on large chunks of human values. The huge mess of implicit assumptions, common sense, and desires of humans are necessary bounds during optimization. This comes into play even at low levels of capability like ChatGPT.

Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target. The "values" that can be expressed in pretrained predictors are forced into conditionalization as a direct and necessary part of training; for a reasonably diverse dataset, the resulting model can't express unconditional preferences regarding external world states. While it's conceivable that some form of "homunculi" could arise, their ability to reach out of their appropriate conditional context is directly and thoroughly trained against.

In other words, the core capabilities of the system arise from a form of training that is both densely informative and blocks the development of unconditional values regarding external world states in the foundational model.

Better forms of fine-tuning, conditioning, and activation interventions (the best versions of each, I suspect, will have deep equivalences) are all built on the capability of that foundational system, and can be directly used to aim that same capability. Learning the huge mess of human values is a necessary part of its training, and its training makes eliciting the relevant part of those values easier—that necessarily falls out of being a machine strongly approximating Bayesian inference across a large dataset.

The final result of this process (both pretraining and conditioning or equivalent tuning) is still an agent that can be described as having unconditional preferences about external world states, but the path to get there strikes me as dramatically more robust both for safety and capability.

Summarizing a bit: I don't think it's required to directly incentivize NNs to form value-laden homunculi, and many of the most concerning paths to forming such homunculi seem worse for capabilities.

Comment by porby on Parametrically retargetable decision-makers tend to seek power · 2023-11-03T22:16:09.715Z · LW · GW

If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?

Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn't seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:

  1. Foundational goal agnosticism evades optimizer-induced automatic doom, and 
  2. Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
  3. They open the door to incrementally building a system that holds the entirety of a safe wish.

Things like "caring about means," or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I'm not sure the same can be said for any implementation that tries to get around the need for that complexity.

It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.

Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-02T22:53:24.748Z · LW · GW

In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.

If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.

In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram).

Composing multiple goal agnostic systems into a new system, or just giving a single goal agnostic system some trivial scaffolding, does not necessarily yield goal agnosticism in the new system. It won't necessarily eliminate it, either; it depends on what the resulting system is.

Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?

Yes; during training, a non-goal agnostic optimizer can produce a goal agnostic predictor.

Comment by porby on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-27T19:44:36.257Z · LW · GW

I agree with the specific claims in this post in context, but the way they're presented makes me wonder if there's a piece missing which generated that presentation.

And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.

It is correct to say that, if you know nothing about the nature of the system's execution, this kind of natural language query is very little information. A deceptive system could output exactly the same thing. It's stronger evidence that the system isn't an agent that's aggressively open with its incorrigibility, but that's pretty useless.

If you somehow knew that, by construction of the underlying language model, there was a strong correlation between these sorts of natural language queries and the actions taken by a candidate corrigible system built on the language model, then this sort of query is much stronger evidence. I still wouldn't call it strong compared to a more direct evaluation, but in this case, guessing that the maybeCorrigibleBot will behave more like the sample query implies is reasonable.

In other words:

Me: Yet more symbol-referent confusion! In fact, this one is a special case of symbol-referent confusion which we usually call “gullibility”, in which one confuses someone’s claim of X (the symbol) as actually implying X (the referent).

If you intentionally build a system where the two are actually close enough to the same thing, this is no longer a confusion.

If my understanding of your position is correct: you wouldn't disagree with that claim, but you would doubt there's a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability. You would expect many simple test cases with current systems like RLHF'd GPT4 in an AutoGPT-ish scaffold with a real shutdown button to work but would consider that extremely weak evidence about the safety properties of a similar system built around GPT-N in the same scaffold.

If I had to guess where we might disagree, it would be in the degree to which language models with architectures similar-ish to current examples could yield a system with properties that permit corrigibility. I'm pretty optimistic about this in principle; I think a there is a subset of predictive training that yields high capability with an extremely constrained profile of "values" that make the system goal agnostic by default. I think there's a plausible and convergent path to capabilities that routes through corrigible-ish systems by necessity and permits incremental progress on real safety.

I've proven pretty bad at phrasing the justifications concisely, but if I were to try again: the relevant optimization pressures during the kinds of predictive training I'm referring to directly oppose the development of unconditional preferences over external world states, and evading these constraints carries a major complexity penalty. The result of extreme optimization can be well-described by a coherent utility function, but one representing only a conditionalized mapping from input to output. (This does not imply or require cognitive or perceptual myopia. This also does not imply that an agent produced by conditioning a predictor remains goal agnostic.)

A second major piece would be that this subset of predictors also gets superhumanly good at "just getting what you mean" (in a particular sense of the phrase) because it's core to the process of Bayesian inference that they implement. They squeeze enormous amount of information out of every available source of conditions and stronger such models do even more. This doesn't mean that the base system will just do what you mean, but it is the foundation on which you can more easily build useful systems.

There are a lot more details that go into this that can be found in other walls of text.

On a meta level:

That conversation we just had about symbol/referent confusions in interpreting language model experiments? That was not what I would call an advanced topic, by alignment standards. This is really basic stuff. (Which is not to say that most people get it right, but rather that it's very early on the tech-tree.) Like, if someone has a gearsy model at all, and actually thinks through the gears of their experiment, I expect they'll notice this sort of symbol/referent confusion.

I've had the occasional conversation that, vibes-wise, went in this direction (not with John).

It's sometimes difficult to escape that mental bucket after someone pattern matches you into it, and it's not uncommon for the heuristic to result in one half the conversation sounding like this post. There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I'm making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.

This isn't to say "and therefore you should put enormous effort reading the manifesto of every individual who happens to speak with you and never use any conversational heuristics," but I worry there's a version of this heuristic happening at the field level with respect to things that could sound like "language models solve corrigibility and alignment."

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-26T16:37:48.470Z · LW · GW

In this very sense, one cannot want an external world state that is already in place, correct?

An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn't stop being a maximizer if it's maximizing.

Let’s say we want to maximize the number of digits of pi we explicitly know.

That's definitely a goal, and I'd describe an agent with that goal as both "wanting" in the previous sense and not goal agnostic.

Also, what about the thermostat question above?

If the thermostat is describable as goal agnostic, then I wouldn't say it's "wanting" by my previous definition. If the question is whether the thermostat's full system is goal agnostic, I suppose it is, but in an uninteresting way.

(Note that if we draw the agent-box around 'thermostat with temperature set to 72' rather than just 'thermostat' alone, it is not goal agnostic anymore. Conditioning a goal agnostic agent can produce non-goal agnostic agents.)

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-24T22:36:06.188Z · LW · GW

If you were using "wanting" the way I was using the word in the previous post, then yes, it would be wrong to describe a goal agnostic system as "wanting" something, because the way I was using that word would imply some kind of preference over external world states.

I have no particular ownership over the definition of "wanting" and people are free to use words however they'd like, but it's at least slightly unintuitive to me to describe a system as "wanting X" in a way that is not distinct from "being X," hence my usage. 

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-23T00:35:30.910Z · LW · GW

If you have a model that "wants" to be goal agnostic in a way that means it behaves in a goal agnostic way in all circumstances, it is goal agnostic. It never exhibits any instrumental behavior arising from unconditional preferences over external world states.

For the purposes of goal agnosticism, that form of "wanting" is an implementation detail. The definition places no requirement on how the goal agnostic behavior is achieved.

In other words:

If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.

A model that "wants" to be goal agnostic such that its behavior is goal agnostic can't be described as "wanting" to be goal agnostic in terms of its utility function; there will be no meaningful additional terms for "being goal agnostic," just the consequences of being goal agnostic.

As a result of how I was using the words, the fact that there is an observable difference between "being" and "wanting to be" is pretty much tautological.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-16T17:26:05.231Z · LW · GW

it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don't understand what axiom or reasoning leads to treating these two things differently.

The difference is subtle but important, in the same way that an agent that "performs bayesian inference" is different from an agent that "wants to perform bayesian inference."

A goal agnostic model does not want to be goal agnostic, it just is. If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.

The observable difference between the two is the presence of instrumental behavior towards whatever goals it has. A model that "wants to perform bayesian inference" might, say, maximize the amount of inference it can do, which (in the pathological limit) eats the universe.

A model that wants to be goal agnostic has fewer paths to absurd outcomes since self-modifying to be goal agnostic is a more local process that doesn't require eating the universe and it may have other values that suggest eating the universe is bad, but it's still not immediately goal agnostic.

From your answers, I understand that you treat goal agnostic agent as an oxymoron, correct?

Agent doesn't have a constant definition across all contexts, but it can be valid to describe a goal agnostic system as a rational agent in the VNM sense. Taking the "ideal predictor" as an example, it has a utility function that it maximizes. In the limit, it very likely represents a strong optimizing process. It just so happens that the goal agnostic utility function does not directly imply maximization with respect to external world states, and does not take instrumental actions that route through external world states (unless the system is conditioned into an agent that is not goal agnostic).

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-14T16:45:27.857Z · LW · GW

Yup, agreed! In the limit, they'd be giving everyone end-the-world buttons. I have hope that the capabilities curve will be such that we can avoid accidentally putting out such buttons, but I still anticipate there being a pretty rapid transition that sees not-catastrophically-bad-but-still-pretty-bad consequences just because it's too hard to change gears on 1-2 year timescales.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-13T15:12:40.484Z · LW · GW

That’s the crux I think: I don’t get why you reject (programmable) learning processes as goal agnostic.

It's important to draw a box around the specific agent under consideration. Suppose I train a model with predictive loss such that the model is goal agnostic. Three things can be simultaneously true:

  1. Viewed in isolation, the optimizer responsible for training the model isn't goal agnostic because it can be described as having preferences over external world state (the model).
  2. The model is goal agnostic because it meets the stated requirements (and is asserted by the hypothetical).
  3. A simulacrum arising from sequences predicted by that goal agnostic predictor when conditioned to predict non-goal agnostic behavior is not goal agnostic.

Let’s say I clone you_genes a few billions time, each time twisting your environment and education until I’m statistically happy with the recipe. What unconditional preferences would you expect to remain?

The resulting person would still be human, and presumably not goal agnostic as a result. A simulacrum produced by an ideal goal agnostic predictor that is conditioned to reproduce the behavior of that human would also not be goal agnostic.

The fact that that those preferences arose conditionally based on your selection process isn't relevant to whether the person is goal agnostic. The relevant kind of conditionality is within the agent under consideration.

Let’a say you_adult are actually a digital brain in some matrix, with an unpleasant boss who stop and randomly restart your emulation each time your preference get over his. Could that process make you_matrix goal agnostic?

No; "I" still have preferences over world states. They're just being overridden.

Bumping up a level and drawing the box around the unpleasant boss and myself combined, still no, because the system expresses my preferences filtered by my boss's preferences.

Some behavior being conditional isn't sufficient for goal agnosticism; there must be no way to describe the agent under consideration as having unconditional preferences over external world states.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-11T02:24:48.568Z · LW · GW

Salvaging the last paragraph of my previous post is pretty difficult. The "it" in "you could call it goal agnostic" was referring to the evolved creature, not natural selection, but the "conditionally required ... specific mutations" would not actually serve to imply goal agnosticism for the creature. I was trying to describe a form of natural selection equivalent to a kind of predictive training but messed it up.

How would you challenge an interpretation of your axioms so that the best answer is we don’t need to change anything at all?

Trying to model natural selection, RL-based training systems, or predictive training systems as agents gets squinty, but all of them could be reasonably described as having "preferences" over the subjects of optimization. They're all explicitly optimizers over external states; they don't meet the goal agnostic criteria.

Some types of predictive training seem to produce goal agnostic systems, but the optimization process is not itself a goal agnostic system.

Regarding humans, I'm comfortable just asserting that we're not goal agnostic. I definitely have preferences over world states. You could describe me with a conditionalized utility function, but that's not sufficient for goal agnosticism; you must be unable to describe me as having unconditional preferences over world states for me to be goal agnostic.

What about selective breeding of dogs? Isn’t that a way to 1) modify natural selection 2) specify which mutation are allowed 3) be reasonably confident we won’t accidentally breed some paperclip maximizer?

Dogs are probably not paperclip maximizers, but dogs seem to have preferences over world states, so that process doesn't produce goal agnostic agents. And the process itself, being an optimizer over world states, is not goal agnostic either.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-10T19:47:29.416Z · LW · GW

That's tough to answer. There's not really a way to make children goal agnostic; humans aren't that kind of thing. In principle, maybe you could construct a very odd corporate entity that is interfaced with like a conditioned predictor, but it strains the question.

It's easier to discuss natural selection in this context by viewing natural selection as the outer optimizer. It's reinforcement learning with a sparse and distant reward. Accordingly, the space of things that could be produced by natural selection is extremely wide. It's not surprising that humans are not machines that monomaniacally maximize inclusive genetic fitness; the optimization process was not so constraining.

There's no way to implement this, but if "natural selection" somehow conditionally required that only specific mutations could be propagated through reproduction, and if there were only negligibly probable paths by which any "evolved" creature could be a potentially risky optimizer, then you could call it goal agnostic. It'd be a pretty useless form of goal agnosticism, though; nothing about that system makes it easy for you to aim it at anything.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-09T15:17:12.453Z · LW · GW

I intentionally left out the details of "what do we do with it" because it's conceptually orthogonal to goal agnosticism and is a huge topic of its own. It comes down to the class of solutions enabled by having extreme capability that you can actually use without it immediately backfiring.

For example, I think this has a real shot at leading to a strong and intuitively corrigible system. I say "intuitively" here because the corrigibility doesn't arise from a concise mathematical statement that solves the original formulation. Instead, it lets us aim it at an incredibly broad and complex indirect specification that gets us all the human messiness we want.