Modeling Risks From Learned Optimization 2021-10-12T20:54:18.555Z
Do mesa-optimizer risk arguments rely on the train-test paradigm? 2020-09-10T15:36:37.629Z
Ben Cottier's Shortform 2020-05-12T11:03:24.265Z
Clarifying some key hypotheses in AI alignment 2019-08-15T21:29:06.564Z


Comment by Ben Cottier (ben-cottier) on AI learns betrayal and how to avoid it · 2021-10-20T09:07:49.174Z · LW · GW

I'm excited about this project. I've been thinking along similar lines about inducing a model to learn deception, in the context of inner alignment. It seems really valuable to have concrete (but benign) examples of a problem to poke at and test potential solutions on. So far there seem to be less concrete examples of deception, betrayal and the like to work with in ML compared to say, distributional shift, or negative side effects.

Comment by Ben Cottier (ben-cottier) on AI learns betrayal and how to avoid it · 2021-10-20T09:07:13.817Z · LW · GW

Previous high level projects have tried to define concepts like "trustworthiness" (or the closely related "truthful") and motivated the AI to follow them. Here we will try the opposite: define "betrayal", and motivate the AIs to avoid it.

Why do you think the betrayal approach is more tractable or useful? It's not clear from the post.

Comment by Ben Cottier (ben-cottier) on Clarifying some key hypotheses in AI alignment · 2021-01-27T18:58:19.030Z · LW · GW

Google Drawings

Comment by Ben Cottier (ben-cottier) on Clarifying some key hypotheses in AI alignment · 2021-01-24T15:38:43.086Z · LW · GW

To your first point - I agree both with why we limited the scope (but also, it was partly just personal interests), and that there should be more of this kind of work on other classes of risk. However, my impression is the literature and "public" engagement (e.g. EA forum, LessWrong) on catastrophic AI misuse/structural risk is too small to even get traction on work like this. We might first need more work to lay out the best arguments. Having said that, I'm aware of a fair amount of writing which I haven't got around to reading. So I am probably misjudging the state of the field.

To your second point - that seems like a real crux and I agree it would be good to expand in that direction. I know some people working on expanded and more in-depth models like this post. It would be great to get your thoughts when they're ready.

Comment by Ben Cottier (ben-cottier) on Clarifying some key hypotheses in AI alignment · 2021-01-24T15:01:52.609Z · LW · GW

It's great to hear your thoughts on the post!

I'd also like to see more posts that do this sort of "mapping". I think that mapping AI risk arguments is too neglected - more discussion and examples in this post by Gyrodiot. I'm continuing to work collaboratively in this area in my spare time, and I'm excited that more people are getting involved.

We weren't trying to fully account for AGI timelines - our choice of scope was based on a mix of personal interest and importance. I know people currently working on posts similar to this that will go in-depth on timelines, discontinuity, paths to AGI, the nature of intelligence, etc. which I'm excited about!

I agree with all your points. You're right that this post's scope does not include broader alternatives for reducing AI risk. It was not even designed to guide what people should work on, though it can serve that purpose. We were really just trying to clearly map out some of the discourse, as a starting point and example for future work.

Comment by Ben Cottier (ben-cottier) on Conditions for Mesa-Optimization · 2020-12-27T22:40:19.791Z · LW · GW

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself

I'm confused about this. I tried substituting different words for "optimisation":

"A system capable of reasoning about photosynthesis is likely also capable of reusing that same machinery to do photosynthesis itself." Nope.

"A system capable of reasoning about arithmetic is likely also capable of reusing that same machinery to do arithmetic itself". Maybe? The rules of arithmetic can be reused, but the machinery to reason abstractly about arithmetic is probably different to the machinery to run a specific calculation, especially with a learned model with lots of free parameters, like a neural network.

Maybe optimisation is not like the above examples because it is so generic? Or I misunderstood the claim.

Comment by Ben Cottier (ben-cottier) on Do mesa-optimizer risk arguments rely on the train-test paradigm? · 2020-09-20T13:27:21.168Z · LW · GW

Thanks. I think I understand, but I'm still confused about the effect on the risk of catastrophe (i.e. not just being pseudo-aligned, but having a catastrophic real-world effect). It may help to clarify that I was mainly thinking of deceptive alignment, not other types of pseudo-alignment. And I'll admit now that I phrased the question stronger than I actually believe, to elicit more response :)

I agree that the probability of pseudo-alignment will be the same, and that an unrecoverable action could occur despite the threat of modification. I'm interested in whether online learning generally makes it less likely for a deceptively aligned model to defect. I think so because (I expect, in most cases) this adds a threat of modification that is faster-acting and easier for a mesa-optimizer to recognise than otherwise (e.g. human shutting it down).

If I'm not just misunderstanding and there is a crux here, maybe it relates to how promising worst-case guarantees are. Worst-case guarantees are great to have, and save us from worrying about precisely how likely a catastrophic action is. Maybe I am more pessimistic than you about obtaining worst-case guarantees. I think we should do more to model the risks probabilistically.

Comment by Ben Cottier (ben-cottier) on Deceptive Alignment · 2020-07-28T10:17:27.396Z · LW · GW

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading which finds that "Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence." If convergence is becoming less common in practical systems, it's important to think about the implications of that for mesa-optimization.

Comment by Ben Cottier (ben-cottier) on Ben Cottier's Shortform · 2020-05-18T16:58:52.540Z · LW · GW


Church's views on AI seem far away from my and most people's views in the AI risk community, and really intrigued me. It would be great to try distil and summarise these views to update on it properly.

Comment by Ben Cottier (ben-cottier) on Ben Cottier's Shortform · 2020-05-12T11:03:24.582Z · LW · GW

Model of the threat and interventions for mesa-optimization

  • Consider a chain model
    • Base optimizer
    • -> Mesa optimizer
      • Produced through optimization of base objective in training environment
    • -> Misalignment (base objective != mesa objective)
      • Different kinds of misalignment
        • Proxy: mesa objective is a proxy for the base objective in the training environment
          • Side-effect: optimizing mesa objective happens to optimize base objective
          • Instrumental: optimizing base objective happens to optimize mesa objective
        • Approximate: objectives differ due to an approximation error, caused by limits of representation in mesa-optimizer's model
        • Suboptimal: optimizing mesa objective happens to optimize base objective due to a flaw
      • What makes this particularly concerning relative to the "standard" alignment problem? It's that the misalignment may be particularly difficult to detect (see next step). The mesa-optimizer has been "screened" by the base optimizer during training; it has to align to the base objective to some extent, at least in the training environment. This either means it is aligned, or deceptively misaligned. The base probability of a misaligned mesa-optimizer is lower than a misaligned system in general, but conditioned on it happening, it could carry greater risk.
    • -> Deployment
      • Not enough evidence of risk based on training and testing in controlled environment
      • Mesa optimizer is deceptive, to increase its chance of deployment
    • -> Distribution shift
      • Deployment environment differs from training environment or controlled testing environment
    • -> Unanticipated behaviour
      • The distribution shift causes the misalignment to manifest in different behaviour
      • There is a "standard" problem of robustness to distribution shift that applies to any machine learning model, not just mesa-optimizers. What makes a mesa-optimizer particularly concerning? It's that optimization will tend to make the unanticipated behaviour more adversarial and influence-seeking. It can freely pursue an objective rather than merely perform a specified task.
    • -> Catastrophe
      • The unanticipated behaviour is so bad that it leads to a permanent, huge loss of value.
      • Ways this doesn't happen:
        • The behaviour is counter-productive: perhaps self-destructive, not directed or consistent enough to lead to anything bad (depends on how sensitive the domain is to random error)
        • Intervention from other actors (depends on decisive strategic advantage)
  • Interventions
    • Base -> Mesa: prevent mesa-optimizer arising
      • Alternative methods of subsystem generation beyond search
    • Mesa -> Misalignment: prevent mesa-optimizer from being misaligned
      • Standard alignment methods
      • Specific methods for base optimizer to align mesa optimizer
    • Misalignment -> Deployment: don't deploy misaligned mesa-optimizer
      • Transparency
      • Interpretability
    • Deployment -> Distribution shift: ensure deployment environment does not differ significantly from training environment or controlled testing environment
      • The converse is probably more tractable, in which case intervention comes before Base -> Mesa
    • Distribution shift -> Unanticipated behaviour: prevent unanticipated behaviour by being aware of this fact, followed by deferring or failing gracefully
      • Out-of-distribution detection
      • Well-calibrated uncertainty estimation
      • Corrigibility
    • Unanticipated behaviour -> Catastrophe: prevent catastrophe
      • Internal
        • Impact measurement
      • External
        • Shut down
        • Stop or steer the behaviour in various ways, depending on the nature of the behaviour
Comment by Ben Cottier (ben-cottier) on Will AI undergo discontinuous progress? · 2020-02-23T20:36:51.511Z · LW · GW

Thanks, that makes sense. To clarify, I realise there are references/links throughout. But I forgot that the takeoff speeds post was basically making the same claim as that quote, and so I was expecting a reference more from the biology space. And there are other places where I'm curious what informed you, e.g. the progress of guns, though that's easier to read up on myself.

Comment by Ben Cottier (ben-cottier) on Gary Marcus: Four Steps Towards Robust Artificial Intelligence · 2020-02-22T23:58:45.615Z · LW · GW
A team of people including Smolensky and Schmidhuber have produced better results on a mathematics problem set by combining BERT with a tensor products (Smolensky et al., 2016), a formal system for representing symbolic variables and their bindings (Schlag et al., 2019), creating a new system called TP-Transformer.

Notable that the latter paper was rejected from ICLR 2020, partly for unfair comparison. It seems unclear at present whether TP-Transformer is better than the baseline transformer.

Comment by Ben Cottier (ben-cottier) on Will AI undergo discontinuous progress? · 2020-02-22T23:27:50.347Z · LW · GW

I think this is a good analysis, and I'm really glad to see this kind of deep dive on an important crux. The most clarifying thing for me was connecting old and new arguments - they seem to have more common ground than I thought.

One thing I would appreciate is more in-text references. There are a bunch of claims here about e.g. history, evolution with no explicit reference. Maybe it seems like common knowledge, but I wasn't sure whether to believe some things, e.g.

Evolution was optimizing for fitness, and driving increases in intelligence only indirectly and intermittently by optimizing for winning at social competition. What happened in human evolution is that it briefly switched to optimizing for increased intelligence, and as soon as that happened our intelligence grew very rapidly but continuously.

Could you clarify? I thought biological evolution always optimizes for inclusive genetic fitness.

Comment by Ben Cottier (ben-cottier) on Clarifying some key hypotheses in AI alignment · 2019-12-21T15:13:06.561Z · LW · GW

Thanks! Comments are much appreciated.

Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??

It's been a few months and I didn't write in detail why that arrow is there, so I can't be certain of the original reason. My understanding now: humans getting economically outcompeted means AI systems are competing with humans, and therefore optimising against humans on some level. Goal-directedness enables/worsens this.

Looking back at the linked explanation of the target loading problem, I understand it as more "at the source": coming up with a procedure that makes AI actually behave as intended. As Richard said there, one can think of it as a more general version of the inner-optimiser (mesa-optimiser) problem. This is why e.g. there's an arrow from "incidental agentive AGI" to "target loading fails". Pointing this arrow to it might make sense, but to me the connection isn't strong enough to be within the "clutter budget" of the diagram.

Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).

Changing the design of those boxes sounds good. I don't want to move them because the arrows would get more cluttered.

Comment by Ben Cottier (ben-cottier) on Clarifying some key hypotheses in AI alignment · 2019-08-18T01:24:17.034Z · LW · GW

Noted and updated.