Long-Term Future Fund Ask Us Anything (September 2023) 2023-08-31T00:28:13.953Z
Lauro Langosco's Shortform 2023-06-16T22:17:10.365Z
An Exercise to Build Intuitions on AGI Risk 2023-06-07T18:35:47.779Z
Uncertainty about the future does not imply that AGI will go well 2023-06-01T17:38:09.619Z
Research Direction: Be the AGI you want to see in the world 2023-02-05T07:15:51.420Z
Some reasons why a predictor wants to be a consequentialist 2022-04-15T15:02:43.676Z
Alignment researchers, how useful is extra compute for you? 2022-02-19T15:35:31.751Z
What alignment-related concepts should be better known in the broader ML community? 2021-12-09T20:44:09.228Z
Discussion: Objective Robustness and Inner Alignment Terminology 2021-06-23T23:25:36.687Z
Empirical Observations of Objective Robustness Failures 2021-06-23T23:23:28.077Z


Comment by Lauro Langosco on Long-Term Future Fund Ask Us Anything (September 2023) · 2023-09-06T20:31:18.891Z · LW · GW

(Newbie guest fund manager here) My impression is there are plans re individuals but they're not very developed or put into practice yet. AFAIK there are currently no plans to fundraise from companies or governments.

Comment by Lauro Langosco on Long-Term Future Fund Ask Us Anything (September 2023) · 2023-09-06T20:28:50.709Z · LW · GW

IMO a good candidate is anything that is object-level useful for X-risk mitigation. E.g. technical alignment work, AI governance / policy work, biosecurity, etc.

Comment by Lauro Langosco on QAPR 5: grokking is maybe not *that* big a deal? · 2023-07-24T11:02:39.479Z · LW · GW

Broadly agree with the takes here.

However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.

This seems right and I don't think we say anything contradicting it in the paper.

I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying 'bricks fall because it's in a brick's nature to move towards the ground': both are repackaging an observation as an explanation.

The idea is that the framing 'learning at different speeds' lets you frame grokking and double descent as the same thing. More like generalizing 'bricks move towards the ground' and 'rocks move towards the ground' to 'objects move towards the ground'. I don't think we make any grand claims about explaining everything in the paper, but I'll have a look and see if there's edits I should make - thanks for raising these points.

Comment by Lauro Langosco on Lauro Langosco's Shortform · 2023-06-23T22:49:04.623Z · LW · GW

There are positive feedback loops between prongs:

  • Successfully containing & using more capable models (p1) gives you more scary demos for p2
  • Success in p1 also speeds up p3 a lot, because:
    • 1) You can empirically study AGI directly, 
    • 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
    • 3) Maybe you can even have (proto-)AGIs do research for you
  • You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research.
  • Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)


If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.

Comment by Lauro Langosco on Lauro Langosco's Shortform · 2023-06-23T22:47:23.292Z · LW · GW

A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).

Prong 1: boxing & capability control (aka ‘careful bootstrapping’)

  • Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff. 
  • Assume the AGI is misaligned. Be super paranoid
  • Goal: get useful work out of boxed AGIs.
    • For example, AIs might be able to do interpretability really well.
    • More generally, for any field there is an effective 2x-10x research productivity boost you can probably get, depending on the field (maybe 100x? unclear, I’m sceptical).
  • Goal: learn more about AGI by studying the systems you've built.
  • To test your security & oversight procedures, you run tests where you train your AGI to escape in a sandbox.
  • This is operationally hard and needs lots of buy-in
  • Research directions: scalable oversight, interpretability for oversight, auditing, adversarial training, capability control / “unlearning”, scaling laws & capabilities forecasting.

Prong 2: scary demos and and convincing people that AGI is dangerous

  • Goal 1: shut it all down, or failing that slow down capabilities research.
  • Goal 2: get operational & political support for the entire approach, which is going to need lots of support, esp first prong
  • In particular make sure that research productivity boosts from AGI don’t feed back into capabilities research, which requires high levels of secrecy + buy-in from a large number of people.
    • Avoiding a speed-up is probably a little bit easier than enacting a slow-down, though maybe not much easier.
  • Demos can get very scary if we get far into prong 1, e.g. we have AGIs that are clearly misaligned or show that they are capable of breaking many of our precautions.

Prong 3: alignment research aka “understanding minds”

  • Goal: understand the systems well enough to make sure they are at least corrigible, or at best ‘fully aligned’.
  • Roughly this involves understanding how the behaviour of the system emerges in enough generality that we can predict and control what happens once the system is deployed OOD, made more capable, etc.
  • Relevant directions: agent foundations / embedded agency, interpretability, some kinds of “science of deep learning”
Comment by Lauro Langosco on Automatic Rate Limiting on LessWrong · 2023-06-23T22:32:40.627Z · LW · GW
Comment by Lauro Langosco on Short timelines and slow, continuous takeoff as the safest path to AGI · 2023-06-22T18:59:35.315Z · LW · GW

whether or not this is the safest path, important actors seem likely to act as though it is

It's not clear to me that this is true, and it strikes me as maybe overly cynical. I get the sense that people at OpenAI and other labs are receptive to evidence and argument, and I expect us to get a bunch more evidence about takeoff speeds before it's too late. I expect people's takes on AGI safety plans to evolve a lot, including at OpenAI. Though TBC I'm pretty uncertain about all of this―definitely possible that you're right here.

Comment by Lauro Langosco on Short timelines and slow, continuous takeoff as the safest path to AGI · 2023-06-21T21:56:12.222Z · LW · GW

Whether or not this is the safest path, the fact that OpenAI thinks it’s true and is one of the leading AI labs makes it a path we’re likely to take. Humanity successfully navigating the transition to extremely powerful AI might therefore require successfully navigating a scenario with short timelines and slow, continuous takeoff.

You can't just choose "slow takeoff". Takeoff speeds are mostly a function of the technology, not company choices. If we could just choose to have a slow takeoff, everything would be much easier! Unfortunately, OpenAI can't just make their preferred timelines & "takeoff" happen. (Though I agree they have some influence, mostly in that they can somewhat accelerate timelines).

Comment by Lauro Langosco on Thomas Kwa's Shortform · 2023-06-16T23:11:53.594Z · LW · GW

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

Comment by Lauro Langosco on Lauro Langosco's Shortform · 2023-06-16T22:17:10.471Z · LW · GW

Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:

  1. Ability to be deceptively aligned
  2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
  3. Incentives to break containment exist in a way that is accessible / understandable to the model
  4. Ability to break containment
  5. Ability to robustly understand human intent
  6. Situational awareness
  7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
  8. Interpretability methods break (or other oversight methods break)
    1. doesn’t have to be because of deceptiveness; maybe thoughts are just too complicated at some point, or in a different place than you’d expect
  9. Capable enough to help us exit the acute risk period

Many alignment proposals rely on reaching these thresholds in a specific order. For example, the earlier we reach (9) relative to other thresholds, the easier most alignment proposals are.

Some of these thresholds are relevant to whether an AI or proto-AGI is alignable even in principle. Short of 'full alignment' (CEV-style), any alignment method (eg corrigibility) only works within a specific range of capabilities:

  • Too much capability breaks alignment, eg bc a model self-reflects and sees all the ways in which its objectives conflicts with human goals.
  • Too little capability (or too little 'coherence') and any alignment method will be non-robust wrt to OOD inputs or even small improvements in capability or self-reflectiveness.
Comment by Lauro Langosco on Uncertainty about the future does not imply that AGI will go well · 2023-06-11T20:09:50.320Z · LW · GW

Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".

Comment by Lauro Langosco on "Corrigibility at some small length" by dath ilan · 2023-06-08T18:56:03.825Z · LW · GW

Thanks for link-posting this! I'd find it useful to have the TLDR at the beginning of the post, rather than at the end (that would also make the last paragraph easier to understand). You did link the TLDR at the beginning, but I still managed to miss it on the first read-through, so I think it would be worth it.

Also: consider crossposting to the alignmentforum.

Edit: also, the author is Eliezer Yudkowsky. Would be good to mention that in the intro.

Comment by Lauro Langosco on An Exercise to Build Intuitions on AGI Risk · 2023-06-08T11:47:21.413Z · LW · GW

I like that mini-game! Thanks for the reference

Comment by Lauro Langosco on Discussion with Nate Soares on a key alignment difficulty · 2023-06-08T11:46:19.050Z · LW · GW

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

FWIW I would love to see the result of you two actually playing a few rounds of this game.

Comment by Lauro Langosco on Uncertainty about the future does not imply that AGI will go well · 2023-06-05T14:24:08.208Z · LW · GW

It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).

Comment by Lauro Langosco on There are no coherence theorems · 2023-06-01T15:13:47.886Z · LW · GW

More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.

Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?

Comment by Lauro Langosco on Could a superintelligence deduce general relativity from a falling apple? An investigation · 2023-05-01T13:19:37.385Z · LW · GW

Newtonian gravity states that objects are attracted to each other in proportion to their mass. A webcam video of two apples falling will show two objects, of slightly differing masses, accelerating at the exact same rate in the same direction, and not towards each other. When you don’t know about the earth or the mechanics of the solar system, this observation points against Newtonian gravity. [...] But it requires postulating the existence of an unseen object offscreen that is 25 orders of magnitude more massive than anything it can see, with a center of mass that is roughly 6 or 7 orders of magnitude farther away than anything it can see in it’s field of view.

IMO this isn't that implausible. A superintelligence (and in fact humans too) will imagine a universe that is larger than what's inside the frame of the image. Once you come up with the idea of an attractive force between masses, it's not crazy to deduce the existence of planets.

Comment by Lauro Langosco on Misgeneralization as a misnomer · 2023-04-17T13:22:00.615Z · LW · GW

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sure).

I agree with much of the rest of this post, eg the paragraphs beginning with "The solutions to these two problems are pretty different."

Here's our definition in the RL setting for reference (from

A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.

FWIW I think this definition is flawed in many ways (for example, the type signature of the agent's inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment's state space; and also it's generally sketchy to extend the reward function beyond the training distribution), but I don't know of a different definition that doesn't have similarly-sized flaws.

Comment by Lauro Langosco on Deep Deceptiveness · 2023-04-13T23:31:07.040Z · LW · GW

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

Comment by Lauro Langosco on Communicating effectively under Knightian norms · 2023-04-04T14:08:58.850Z · LW · GW

I agree with what I read as the main direct claim of this post, which is that it is often worth avoiding making very confident-sounding claims, because it makes it likely for people to misinterpret you or derail the conversation towards meta-level discussions about justified confidence.

However, I disagree with the implicit claim that people who confidently predict AI X-risk necessarily have low model uncertainty. For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.

Comment by Lauro Langosco on Announcing the Alignment of Complex Systems Research Group · 2023-04-03T11:58:49.727Z · LW · GW

To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don't see continuity or "Things will be weird before getting extremely weird" as a crux. (I don't know why you think he does, and don't know what he thinks, but would guess he doesn't think it's a crux either)

Comment by Lauro Langosco on Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent · 2023-04-01T11:52:57.805Z · LW · GW

Thanks for doing this! I think this could be valuable. What's your current plan for developing this further / onboarding collaborators?

Some observations / thoughts from interacting with the QA system for a few minutes:

  • The current version has trouble answering basic AI safety questions. For example:

Why do people think AI is an existential risk? People think AI is an existential risk because of the possibility of a superintelligent AI system with recursive self-improvement capabilities, which could lead to catastrophic consequences like turning humans into paper clips, as highlighted in[1]. However, the perception of AI as an existential risk is not universal, and some people may discount it as a problem that is too vague or unlikely to outweigh other issues facing the world, as discussed in[1]. Nonetheless, technical AI safety research, defined as quality research aimed at long-term outcomes that are reliably beneficial, is necessary to mitigate the risks of AI[2]. Additionally, some argue that rapid increases in human power could be achieved without increasing existential risks, including through skillful AI development, but it is important to prioritize technical AI safety research to avoid catastrophic consequences[3].

  • The search could be much improved by curating the dataset. An easy way to do this is to exclude posts with low upvote counts. Though eventually you'll want to do more, eg by being opinionated about what to include.
  • It might be worth having a chatbot that just talks people through the "extended bad alignment take bingo", that is all the reasons why the easy solutions people like to come up with don't work. Here you could just exclude all proposals for actual alignment solutions from the dataset (and you can avoid having to make calls about what agendas have promise vs. which ones are actually nonsensical)
  • It would be very useful to have a feedback function where people can mark wrong answers. If we want to make this good, we'll need to red-team the model and make sure it answers all the basic questions correctly, probably by curating a Question-Answer dataset
Comment by Lauro Langosco on Against an AI Research Moratorium · 2023-03-31T14:28:18.713Z · LW · GW

This seems wrong. Here's an incomplete list of reasons why:

  1. If the 3 leading labs join the moratorium and AGI is stealthily developed by the 4th, then the arrival of AGI will in fact have been slowed by the lead time of the first 3 labs + the slowdown that the 4th incurs by working in secret.
  2. The point of this particular call for a 6-month moratorium is not to particularly slow down anyone (and as has been pointed out by others, it is possible that OpenAI wasn't even planning to start training GPT-5 in the next few months). Rather, the point is to form a coalition to support future policies, e.g. a government-supported moratorium.
  3. It is actually fairly hard to build compute clusters in secret, because you can just track what comes out of the chip fabs and where it goes
  4. While not straightforward, it's also feasible to monitor existing clusters, see e.g.
Comment by Lauro Langosco on Deep Deceptiveness · 2023-03-29T11:40:37.199Z · LW · GW

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).

For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary.

I agree we're not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any "aligned" behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I'm not sure "coherent" is the right way to talk about this... wish I had a more precise concept here.)

We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren't going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn't lead to a good outcome, then the tAGI can reason the same way about its own desires).

(I agree that if we can get aligned desires that are stable under reflection, then maybe the 'use non-endorsed desires to tide us over' plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that - we currently just don't have that level of fine control over capabilities).

The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate's story doesn't hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.

Some other difficulties that I see:

  1. The 'capability profile' (ie the relative levels of the toddler AGI's capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we're at least careful enough to remove code from the training data, etc).
  2. A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
  3. All of my reasoning here is kind of based on fuzzy confused concepts like 'coherence' and 'capability to self-reflect', and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.
Comment by Lauro Langosco on Abstracts should be either Actually Short™, or broken into paragraphs · 2023-03-27T17:24:46.905Z · LW · GW

Yeah that seems reasonable! (Personally I'd prefer a single break between sentence 3 and 4)

Comment by Lauro Langosco on Abstracts should be either Actually Short™, or broken into paragraphs · 2023-03-27T11:07:53.930Z · LW · GW

IMO ~170 words is a decent length for a well-written abstract (well maybe ~150 is better), and the problem is that abstracts are often badly written. Steve Easterbrook has a great guide on writing scientific abstracts; here's his example template which I think flows nicely:

(1) In widgetology, it’s long been understood that you have to glomp the widgets before you can squiffle them. (2) But there is still no known general method to determine when they’ve been sufficiently glomped. (3) The literature describes several specialist techniques that measure how wizzled or how whomped the widgets have become during glomping, but all of these involve slowing down the glomping, and thus risking a fracturing of the widgets. (4) In this thesis, we introduce a new glomping technique, which we call googa-glomping, that allows direct measurement of whifflization, a superior metric for assessing squiffle-readiness. (5) We describe a series of experiments on each of the five major types of widget, and show that in each case, googa-glomping runs faster than competing techniques, and produces glomped widgets that are perfect for squiffling. (6) We expect this new approach to dramatically reduce the cost of squiffled widgets without any loss of quality, and hence make mass production viable.

Comment by Lauro Langosco on Deep Deceptiveness · 2023-03-24T22:44:27.047Z · LW · GW

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

Comment by Lauro Langosco on Deep Deceptiveness · 2023-03-24T15:50:16.580Z · LW · GW

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

  • In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (I think you already agree with this, but it seemed worth pointing out that this is a pretty harsh set of prerequisites, especially given that we don't have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
  • The concept of docility that you want to align it to needs be very specific and robust against lots of different kinds of thinking. You need it to conclude that you don't want it to deceive you / train itself for a bit longer / escape containment / etc, but at the same time you don't want it to extrapolate out your intent too much (it could be so much more helpful if it did train itself for a little longer, or if it had a copy of itself running on more compute, or it learns that there are some people out there who would like it if the AGI were free, or something else I haven't thought of)
  • You only have limited bits of optimization to expend on getting it to be inner aligned bc of deceptive alignment.
  • There's all the classic problems with corrigibility vs. consequentialism (and you can't get around those by building something that is not a reflective consequentialist, because that again is not stable under capability gains).
Comment by Lauro Langosco on Deep Deceptiveness · 2023-03-23T18:08:03.244Z · LW · GW

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can't just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
  • Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
  • The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI's terminal goals on some level include 'being nondeceptive'.
  • Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it's not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
  • This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
  • Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
  • A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).
Comment by Lauro Langosco on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-10T22:58:27.923Z · LW · GW

(Crossposting some of my twitter comments).

I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.

  1. I think that instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose. (The rest of my comments are all variations on this basic theme).

  2. We humans may be a hot mess, but we're far better at influencing (optimizing) our environment than any other animal or ML system. Example: we build helicopters and roads, which are very unlikely to arise by accident in a world without people trying to build helicopters or roads. If a system is good enough at achieving outcomes, it is dangerous whether or not it is a "hot mess".

  3. It's much easier for us to describe simple behaviors as utility maximization; for example a ball rolling down a hill is well-described as minimizing its potential energy. So it's natural that people will rate a dumb / simple system as being more easily described by a utility function than a smart system with complex behaviors. This does not make the smart system any less dangerous.

  4. Misalignment risk is not about expecting a system to "inflexibly" or "monomanically" pursuing a simple objective. It's about expecting systems to pursue objectives at all. The objectives don't need to be simple or easy to understand.

  5. Intelligence isn't the right measure to have on the X-axis - it evokes a math professor in an ivory tower, removed from the goings-on in the real world. A better word might be capability: "how good is this entity at going out into the world and getting more of what it wants?"

  6. In practice, AI labs are working on improving capability, rather than intelligence defined abstractly in a way that does not connect to capability. And capability is about achieving objectives.

  7. If we build something more capable than humans in a certain domain, we should expect it to be "coherent" in the sense that it will not make any mistakes that a smart human wouldn't have made. Caveat: it might make more of a particular kind of mistake, and make up for it by being better at other things. This happens with current systems, and IMO plausibly we'll see something similar even in the kind of system I'd call AGI. But at some point the capabilities of AI systems will be general enough that they will stop making mistakes that are exploitable by humans. This includes mistakes like "fail to notice that your programmer could shut you down, and that would stop you from achieving any of your objectives".

Comment by Lauro Langosco on Full Transcript: Eliezer Yudkowsky on the Bankless podcast · 2023-02-25T16:43:49.500Z · LW · GW

Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.

Comment by Lauro Langosco on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-01-21T22:11:44.825Z · LW · GW


Comment by Lauro Langosco on Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] · 2023-01-14T13:35:47.172Z · LW · GW

Is there an open-source implementation of causal scrubbing available?

Comment by Lauro Langosco on Why The Focus on Expected Utility Maximisers? · 2023-01-01T11:58:49.978Z · LW · GW

I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).

And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.

Comment by Lauro Langosco on Slightly against aligning with neo-luddites · 2022-12-29T12:09:47.233Z · LW · GW

I agree it's important to be careful about which policies we push for, but I disagree both with the general thrust of this post and the concrete example you give ("restrictions on training data are bad").

Re the concrete point: it seems like the clear first-order consequence of any strong restriction is to slow down AI capabilities. Effects on alignment are more speculative and seem weaker in expectation. For example, it may be bad if it were illegal to collect user data (eg from users of chat-gpt) for fine-tuning, but such data collection is unlikely to fall under restrictions that digital artists are lobbying for.

Re the broader point: yes, it would be bad if we just adopted whatever policy proposals other groups propose. But I don't think this is likely to happen! In a successful alliance, we would find common interests between us and other groups worried about AI, and push specifically for those. Of course it's not clear that this will work, but it seems worth trying.

Comment by Lauro Langosco on Richard Ngo's Shortform · 2022-12-14T18:03:00.109Z · LW · GW

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".

(Though of course it's important to spell the argument out)

Comment by Lauro Langosco on Richard Ngo's Shortform · 2022-12-14T17:58:02.790Z · LW · GW

I agree with your general point here, but I think Ajeya's post actually gets this right, eg

There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”


What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from trying to maximize reward? This is very plausible to me, but I don’t think this possibility provides much comfort -- I still think Alex would want to attempt a takeover.

Comment by Lauro Langosco on Trying to Make a Treacherous Mesa-Optimizer · 2022-11-14T14:17:37.320Z · LW · GW

FWIW I believe I wrote that sentence and I now think this is a matter of definition, and that it’s actually reasonable to think of an agent that e.g. reliably solves a maze as an optimizer even if it does not use explicit search internally.

Comment by Lauro Langosco on I'm planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on? · 2022-09-24T14:25:31.953Z · LW · GW
  • importance / difficulty of outer vs inner alignment
  • outlining some research directions that seem relatively promising to you, and explain why they seem more promising than others
Comment by Lauro Langosco on Common misconceptions about OpenAI · 2022-08-27T14:41:01.080Z · LW · GW

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

Comment by Lauro Langosco on Common misconceptions about OpenAI · 2022-08-27T14:20:20.597Z · LW · GW

People at OpenAI regularly say things like

And you say:

  • OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities

AFAICT, no-one from OpenAI has publicly explained why they believe that RLHF + amplification is supposed to be enough to safely train systems that can solve alignment for us. The blog post linked above says "we believe" four times, but does not take the time to explain why anyone believes these things.

Writing up this kind of reasoning is time-intensive, but I think it would be worth it: if you're right, then the value of information for the rest of the community is huge; if you're wrong, it's an opportunity to change your minds.

Comment by Lauro Langosco on Gradient descent doesn't select for inner search · 2022-08-16T15:13:20.044Z · LW · GW

(Note that I'm not making a claim about how search is central to human capabilities relative to other species; I'm just saying search is useful in general. Plausibly also for other species, though it is more obvious for humans)

From my POV, the "cultural intelligence hypothesis" is not a counterpoint to importance of search. It's obvious that culture is important for human capabilities, but it also seems obvious to me that search is important. Building printing presses or steam engines is not something that a bundle of heuristics can do, IMO, without gaining those heuristics via a long process of evolutionary trial-and-error. And it seems important that humans can build steam engines without generations of breeding better steam-engine-engineers.

Re AlphaStar and AlphaZero: I've never played Starcraft, so I don't have good intuitions for what capabilities are needed. But on the definitions of search that I use, the AlphaZero policy network definitely performs search. In fact out of current systems it's probably the one that most clearly performs search!

...Now I'm wondering whether our disagreement just comes from having different definitions of search in mind. Skimming your other comments above, it seems like you take a more narrow view of search = literally iterating through solutions and picking a good one. This is fine by me definitionally, but I don't think the fact that models will not learn search(narrow) is very interesting for alignment, or has the implications that you list in the post? Though ofc I might still be misunderstanding you here.

Comment by Lauro Langosco on Gradient descent doesn't select for inner search · 2022-08-15T21:11:31.125Z · LW · GW

I think you overestimate the importance of the genomic bottleneck. It seems unlikely that humans would have been as successful as we are if we were... the alternative to the kind of algorithm that does search, which you don't really describe.

Performing search to optimize an objective seems really central to our (human's) capabilities, and if you want to argue against that I think you should say something about what an algorithm is supposed to look like that is anywhere near as capable as humans but doesn't do any search.

Comment by Lauro Langosco on On how various plans miss the hard bits of the alignment challenge · 2022-08-08T17:11:36.689Z · LW · GW


Comment by Lauro Langosco on Two-year update on my personal AI timelines · 2022-08-05T20:23:17.483Z · LW · GW

Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks!

...though now I'm confused about why we would assume that. Surely that assumption is wrong?

  • Humans are very constrained in terms of brain size and data, so we shouldn't assume that these quantities are scaled optimally in some sense that generalizes to deep learning models.
  • Anyhow we don't need to guess the amount of data the human brain needs: we can just estimate it directly, just like we estimate brain-parameter count.

To move to a more general complaint about the bio anchors paradigm: it never made much sense to assume that current scaling laws would hold; clearly scaling will change once we train on new data modalities; we know that human brains have totally different scaling laws than DL models; and an AGI architecture will again have different scaling laws. Going with the GPT-3 scaling law is a very shaky best guess.

So it seems weird to me to put so much weight on this particular estimate, such that someone figuring out how to scale models much more cheaply would update one in the direction of longer timelines! Surely the bio anchor assumptions cannot possibly be strong enough to outweigh the commonsense update of 'whoa, we can scale much more quickly now'?

The only way that update makes sense is if you actually rely mostly on bio anchors to estimate timelines (rather than taking bio anchors to be a loose prior, and update off the current state and rate of progress in ML), which seems very wrong to me.

Comment by Lauro Langosco on Two-year update on my personal AI timelines · 2022-08-03T12:57:09.947Z · LW · GW

But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.

I'm confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?

(Except if you had expected even better scaling laws by now, but it didn't sound like that was your argument?)

Comment by Lauro Langosco on On how various plans miss the hard bits of the alignment challenge · 2022-07-12T09:58:19.058Z · LW · GW

What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

Comment by Lauro Langosco on Let's See You Write That Corrigibility Tag · 2022-06-22T20:10:35.950Z · LW · GW

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all


  • Its objective is no more broad or long-term than is required to complete the task
  • In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
  • It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator


  • It doesn't maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
  • It doesn't "optimize too hard" (not sure how to state this better)
    • Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
  • Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like "make me a coffee", or "tell me a likely outcome of this plan") and then shuts down
  • It doesn't optimize over causal pathways you don't want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
  • It does not try to become more consequentialist with respect to its goals
    • for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior

No weird stuff

  • It doesn't try to acausally cooperate or trade with far-away possible AIs
  • It doesn't come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
  • It doesn't attempt to simulate a misaligned intelligence
  • In fact it doesn't simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task

Human imitation

  • Where possible, it should imitate a human that is trying to be corrigible
  • To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
  • When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour ("what would a corrigible, unusually smart / competent / knowledgable human do?")
  • To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or "optimize too hard", and [other corrigibility desiderata]

Querying / robustness

  • Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
  • It will raise an exception, i.e. pause execution of its plans and notify its operators if
    • its instructions are unclear
    • it recognizes a flaw in its design
    • it sees a way in which corrigibility could be strengthened
    • in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
    • in the course of performing its task, its operators would predictably be deceived / misled about the state of the world
Comment by Lauro Langosco on Relaxed adversarial training for inner alignment · 2022-06-09T21:02:10.135Z · LW · GW

Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'

Comment by Lauro Langosco on Some reasons why a predictor wants to be a consequentialist · 2022-05-03T22:10:38.984Z · LW · GW

But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.

During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don't know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.

Re: tameness of (using min cause L is a loss), some things that come to mind are

a) is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus .

b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)

(probably this list can be extended)