Long-Term Future Fund Ask Us Anything (September 2023) 2023-08-31T00:28:13.953Z
Introducing the Center for AI Policy (& we're hiring!) 2023-08-28T21:17:11.703Z
Long-Term Future Fund: April 2023 grant recommendations 2023-08-02T07:54:49.083Z
Challenge: construct a Gradient Hacker 2023-03-09T02:38:32.999Z
Wentworth and Larsen on buying time 2023-01-09T21:31:24.911Z
Ways to buy time 2022-11-12T19:31:10.411Z
Thomas Larsen's Shortform 2022-11-08T23:34:10.214Z
Instead of technical research, more people should focus on buying time 2022-11-05T20:43:45.215Z
Possible miracles 2022-10-09T18:17:01.470Z
Neural Tangent Kernel Distillation 2022-10-05T18:11:54.687Z
7 traps that (we think) new alignment researchers often fall into 2022-09-27T23:13:46.697Z
Alignment Org Cheat Sheet 2022-09-20T17:36:58.708Z
Inner Alignment via Superpowers 2022-08-30T20:01:52.129Z
(My understanding of) What Everyone in Technical Alignment is Doing and Why 2022-08-29T01:23:58.073Z
Finding Goals in the World Model 2022-08-22T18:06:48.213Z
The Core of the Alignment Problem is... 2022-08-17T20:07:35.157Z
Project proposal: Testing the IBP definition of agent 2022-08-09T01:09:37.687Z
Broad Basins and Data Compression 2022-08-08T20:33:16.846Z
Intuitive Explanation of AIXI 2022-06-12T21:41:36.240Z
Infra-Bayesianism Distillation: Realizability and Decision Theory 2022-05-26T21:57:19.592Z


Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2024-05-28T01:04:35.999Z · LW · GW

Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs. 

If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable.  Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks. 

Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have.  And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model. 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2024-05-27T22:24:23.412Z · LW · GW

Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.  

The basic case against against Effective-FLOP. 

  1. We're seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempting: add a multiplier to account for algorithmic progress. 
  2. But doing this multiplications seems importantly quite ambiguous. 
    1. Effective FLOPs depend on the underlying benchmark. It’s not at all apparent which benchmark people are talking about, but this isn’t obvious. 
      1. People often use perplexity, but applying post training enhancements like scaffolding or chain of thought doesn’t improve perplexity but does improve downstream task performance. 
      2. See for examples of algorithmic changes that cause variable performance gains based on the benchmark. 
    2. Effective FLOPs often depend on the scale of the model you are testing. See graph below from: - the compute efficiency from from LSTMs to transformers is not invariant to scale. This means that you can’t just say that the jump from X to Y is a factor of Z improvement on Capability per FLOP.  This leads to all sorts of unintuitive properties of effective FLOPs. For example, if you are using 2016-next-token-validation-E-FLOPs, and LSTM scaling becomes flat on the benchmark, you could easily imagine that at very large scales you could get a 1Mx E-FLOP improvement from switching to transformers, even if the actual capability difference is small. 
    3. If we move away from pretrained LLMs, I think E-FLOPs become even harder to define, e.g., if we’re able to build systems may be better at reasoning but worse at knowledge retrieval. E-FLOPs does not seem very adaptable. 
    4. (these lines would need to parallel for the compute efficiency ratio to be scale invariant on test loss) 
  3. Users of E-FLOP often don’t specify the time, scale, or benchmark that they are talking about it with respect to, which makes it very confusing. In particular, this concept has picked up lots of steam and is used in the frontier lab scaling policies, but is not clearly defined in any of the documents. 
    1. Anthropic: “Effective Compute: We define effective compute as roughly the amount of compute it would have taken to train a model if no improvements to pretraining or fine-tuning techniques are included. This is operationalized by tracking the scaling of model capabilities (e.g. cross-entropy loss on a test set).”
      1. This specifies the metric, but doesn’t clearly specify any of (a) the techniques that count as the baseline, (b) the scale of the model where one is measuring E-FLOP with respect to, or (c) how they handle post training enhancements that don’t improve log loss but do dramatically improve downstream task capability. 
    2. OpenAI on when they will run their evals: “This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough"
      1. They don’t define effective compute at all. 
    3. Since there is significant ambiguity in the concept, it seems good to clarify what it even means. 
  4. Basically, I think that E-FLOPs are confusing, and most of the time when we want to use flops, we’re usually just going to be better off talking directly about benchmark scores. For example, instead of saying “every 2x effective FLOP” say “every 5% performance increase on [simple benchmark to run like MMLU, GAIA, GPQA, etc] we’re going to run [more thorough evaluations, e.g. the ASL-3 evaluations]. I think this is much clearer, much less likely to have weird behavior, and is much more robust to changes in model design. 
    1. It’s not very costly to run the simple benchmarks,  but there is a small cost here. 
    2. A real concern is that it is easier to game benchmarks than FLOPs. But I’m concerned that you could get benchmark gaming just the same with E-FLOPs because E-FLOPs are benchmark dependent — you could make your model perform poorly on the relevant benchmark and then claim that you didn’t scale E-FLOPs at all, even if you clearly have a broadly more capable model. 


A3 in also discusses limitations of effective FLOPs. 

Comment by Thomas Larsen (thomas-larsen) on Matthew Barnett's Shortform · 2024-01-28T07:16:33.463Z · LW · GW

The fact that AIs will be able to coordinate well with each other, and thereby choose to "merge" into a single agent

My response: I agree AIs will be able to coordinate with each other, but "ability to coordinate" seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to "merge" with each other.

Ability to coordinate being continuous doesn't preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infinite right at the start? 

And of course current AIs being bad at coordination is true, but this doesn't mean that future AIs won't be.  

Comment by Thomas Larsen (thomas-larsen) on Evolution provides no evidence for the sharp left turn · 2023-10-11T02:25:56.764Z · LW · GW

Thanks for the response! 

If instead of reward circuitry inducing human values, evolution directly selected over policies, I'd expect similar inner alignment failures.

I very strongly disagree with this. "Evolution directly selecting over policies" in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It's the loss landscape an parameter-function map that are responsible for most of a learning process's inductive biases (especially for large amounts of data). See: Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent

I think I understand these points, and I don't see how this contradicts what I'm saying. I'll try rewording. 

Consider the following gaussian process: 
What is Gaussian Process? [Intuitive Explaination] | by Joanna | Geek  Culture | Medium

Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don't have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor. 

Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven't generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD. 

You've argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. My claim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Since most generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn't load bearing on this point -- the related load bearing assumption is the lack of a an inductive bias that is benign. 

If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don't have to worry about this, but as yet, I don't see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases. 

It feels to me as if your argument is that we understand neither evolution nor NN inductive biases, and so we can't make strong predictions about OOD generalization, so we are left with our high uncertainty prior over all of the possible proxies that we could find. It seems to me that we are far from being able to argue things like "because of inductive bias from the NN architecture, we'll get non-deceptive AIs, even if there is a deceptive basin in the loss landscape that could get higher reward." 

I suspect you think bad misgeneralization happens only when you have a two layer selection process (and this is especially sharp when there's a large time disparity between these processes), like evolution setting up the human within lifetime learning. I don't see why you think that these types of functions would be more likely to misgeneralize. 

(only responding to the first part of your comment now, may add on additional content later) 

Comment by Thomas Larsen (thomas-larsen) on Evaluating the historical value misspecification argument · 2023-10-06T01:37:39.237Z · LW · GW
Comment by Thomas Larsen (thomas-larsen) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-30T17:40:02.545Z · LW · GW

We haven't asked specific individuals if they're comfortable being named publicly yet, but if advisors are comfortable being named, I'll announce that soon. We're also in the process of having conversations with academics, AI ethics folks,  AI developers at small companies, and other civil society groups to discuss policy ideas with them.

So far, I'm confident that our proposals will not impede the vast majority of AI developers, but if we end up receiving feedback that this isn't true, we'll either rethink our proposals or remove this claim from our advocacy efforts.  Also, as stated in a comment below:

I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development -- most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.

Comment by Thomas Larsen (thomas-larsen) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-30T16:01:58.053Z · LW · GW

I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development -- most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.

Comment by Thomas Larsen (thomas-larsen) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-30T13:59:50.268Z · LW · GW

(ETA: these are my personal opinions) 


  1. We're going to make sure to exempt existing open source models. We're trying to avoid pushing the frontier of open source AI, not trying to put the models that are already out their back in the box, which I agree is intractable. 
  2. These are good points, and I decided to remove the data criteria for now in response to these considerations. 
  3. The definition of frontier AI is wide because it describes the set of models that the administration has legal authority over, not the set of models that would be restricted. The point of this is to make sure that any model that could be dangerous would be included in the definition. Some non-dangerous models will be included, because of the difficulty with predicting the exact capabilities of a model before training.  
  4. We're planning to shift to recommending a tiered system in the future, where the systems in the lower tiers have a reporting requirement but not a licensing requirement. 
  5. In order to mitigate the downside of including too many models, we have a fast track exemption for models that are clearly not dangerous but technically fall within the bounds of the definition. 
  6. I don't expect this to impact the vast majority of AI developers outside the labs. I do think that open sourcing models at the current frontier is dangerous and want to prevent future extensions of the bar. Insofar as that AI development was happening on top of models produced by the labs, it would be affected. 
  7. The threshold is a work in progress. I think it's likely that they'll be revised significantly throughout this process. I appreciate the input and pushback here. 
Comment by Thomas Larsen (thomas-larsen) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-29T23:35:32.227Z · LW · GW


I spoke with a lot of other AI governance folks before launching, in part due to worries about the unilateralists curse. I think that there is a chance this project ends up being damaging, either by being discordant with other actors in the space, committing political blunders, increasing the polarization of AI, etc. We're trying our best to mitigate these risks (and others) and are corresponding with some experienced DC folks who are giving us advice, as well as being generally risk-averse in how we act. That being said, some senior folks I've talked to are bearish on the project for reasons including the above. 

DM me if you'd be interested in more details, I can share more offline. 

Comment by Thomas Larsen (thomas-larsen) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-29T03:17:27.076Z · LW · GW

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens. 

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable. 

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe. 

This is the threshold for the government has the ability to say no to, and is deliberately set well before catastrophe. 

I also think that one route towards AGI in the event that we try to create a global shutdown of AI progress is by building up capabilities on top of whatever the best open source model is, and so I'm hesitant to give up the government's ability to prevent the capabilities of the best open source model from going up. 

The cutoffs also don't differentiate between sparse and dense models, so there's a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.

Thanks for pointing this out, I'll think about if there's a way to exclude sparse models, though I'm not sure if its worth the added complexity and potential for loopholes. I'm not sure how many models fall into this category -- do you have a sense? This aggregation of models has around 40 models above the 70B threshold. 

Comment by Thomas Larsen (thomas-larsen) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-29T02:39:38.552Z · LW · GW

It's worth noting that this (and the other thresholds) are in place because we need a concrete legal definition for frontier AI, not because they exactly pin down which AI models are capable of catastrophe. It's probable that none of the current models are capable of catastrophe. We want a sufficiently inclusive definition such that the licensing authority has the legal power over any model that could be catastrophically risky.  

That being said -- Llama 2 is currently the best open-source model and it gets 68.9% on the MMLU. It seems relatively unimportant to regulate models below Llama 2 because anyone who wanted to use that model could just use Llama 2 instead. Conversely, models that are above Llama 2 capabilities are at the point where it seems plausible that they could be bootstrapped into something dangerous. Thus, our threshold was set just above the limit. 

Of course, by the time this regulation would pass, newer open-source models are likely to come out, so we could potentially set the bar higher. 

Comment by Thomas Larsen (thomas-larsen) on DeepMind: Model evaluation for extreme risks · 2023-06-16T23:24:07.094Z · LW · GW

Yeah, this is fair,  and later in the section they say: 

Careful scaling. If the developer is not confident it can train a safe model at the scale it initially had planned, they could instead train a smaller or otherwise weaker model.

Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked. 

I still think the "delay/pause" wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think there's some sort of implicit picture that the eval result will become unconcerning in a matter of weeks-months, which I just don't see the mechanism for short of actually good alignment progress. 

Comment by Thomas Larsen (thomas-larsen) on DeepMind: Model evaluation for extreme risks · 2023-06-16T22:11:57.075Z · LW · GW

The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one

It's very disappointing to me that this sentence doesn't say "cancel". As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.

Comment by Thomas Larsen (thomas-larsen) on Evolution provides no evidence for the sharp left turn · 2023-05-31T20:34:11.422Z · LW · GW
  1. Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
  2. Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.[4]
  3. Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.[5]

I think that the conditions for an SLT to arrive are weaker than you describe. 

For (1), it's unclear to me why you think you need to have this multi-level inner structure.[1] If instead of reward circuitry inducing human values, evolution directly selected over policies, I'd expect similar inner alignment failures. It's also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection. 

For (2), it seems like you are conflating 'amount of real world time' with 'amount of consequences-optimization'. SGD is just a much less efficient optimizer than intelligent cognition -- in-context learning happens much faster than SGD learning. When the inner optimizer starts learning and accumulating knowledge, it seems totally plausible to me that this will happen on much faster timescales than the outer selection. 

For (3), I don't think that the SLT requires the inner optimizer to run freely,  it only requires one of: 

a. the inner optimizer running much faster than the outer optimizer, such that the updates don't occur in time. 

b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss's updates are ineffective. 

  1. ^

    Evolution, of course, does have this structure, with 2 levels of selection, it just doesn't seem like this is a relevant property for thinking about the SLT. 

Comment by Thomas Larsen (thomas-larsen) on Why I'm Not (Yet) A Full-Time Technical Alignment Researcher · 2023-05-26T19:09:17.637Z · LW · GW

Sometimes, but the norm is to do 70%. This is mostly done on a case by case basis, but salient factors to me include:

  • Does the person need the money? (what cost of living place are they living in, do they have a family, etc) 
  • What is the industry counterfactual? If someone would make 300k, we likely wouldn't pay them 70%, while if their counterfactual was 50k, it feels more reasonable to pay them 100% (or even more). 
  • How good is the research?
Comment by Thomas Larsen (thomas-larsen) on Why I'm Not (Yet) A Full-Time Technical Alignment Researcher · 2023-05-25T06:26:27.634Z · LW · GW

I'm a guest fund manager for the LTFF, and wanted to say that my impression is that the LTFF is often pretty excited about giving people ~6 month grants to try out alignment research at 70% of their industry counterfactual pay (the reason for the 70% is basically to prevent grift). Then, the LTFF can give continued support if they seem to be doing well. If getting this funding would make you excited to switch into alignment research, I'd encourage you to apply. 

I also think that there's a lot of impactful stuff to do for AI existential safety that isn't alignment research! For example, I'm quite into people doing strategy, policy outreach to relevant people in government, actually writing policy, capability evaluations, and leveraged community building like CBAI. 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-05-18T05:17:53.836Z · LW · GW

Some claims I've been repeating in conversation a bunch: 

Safety work (I claim) should either be focused on one of the following 

  1. CEV-style full value loading, to deploy a sovereign 
  2. A task AI that contributes to a pivotal act or pivotal process. 

I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it's useful to know what pivotal process you are aiming for. Specifically, why aren't you just directly working to make that pivotal act/process happen? Why do you need an AI to help you? Typically, the response is that the pivotal act/process is too difficult to be achieved by humans. In that case, you are pushing into a difficult capabilities regime -- the AI has some goals that do not equal humanity's CEV, and so has a convergent incentive to powerseek and escape. With enough time or intelligence, you therefore get wrecked, but you are trying to operate in this window where your AI is smart enough to do the cognitive work, but is 'nerd-sniped' or focused on the particular task that you like. In particular, if this AI reflects on its goals and starts thinking big picture, you reliably get wrecked. This is one of the reasons that doing alignment research seems like a particularly difficult pivotal act to aim for. 


Comment by thomas-larsen on [deleted post] 2023-05-05T17:31:02.398Z

(I deleted this comment)

Comment by Thomas Larsen (thomas-larsen) on Discussion about AI Safety funding (FB transcript) · 2023-05-01T22:13:37.667Z · LW · GW

Fwiw I'm pretty confident that if a top professor wanted funding at 50k/year to do AI Safety stuff they would get immediately funded, and that the bottleneck is that people in this reference class aren't applying to do this. 

There's also relevant mentorship/management bottlenecks in this, so funding them to do their own research is generally a lot less overall costly than if it also required oversight. 

(written quickly, sorry if unclear)

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-04-12T00:01:19.275Z · LW · GW

Thinking about ethics.

After thinking more about orthogonality I've become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is 'right' with a paperclipper, there's nothing I can say to them to convince them to instead value human preferences or whatever. 

I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism -> moral realism. I now reject the implication, and think that there is both 1) no universal, objective morality, and 2) things matter. 

My current approach is to think of "goodness" in terms of what CEV-Thomas would think of as good. Moral uncertainty, then, is uncertainty over what CEV-Thomas thinks. CEV is necessary to get morality out of a human brain, because it is currently a bundle of contradictory heuristics. However, my moral intuitions still give bits about goodness. Other people's moral intuitions also give some bits about goodness, because of how similar their brains are to mine, so I should weight other peoples beliefs in my moral uncertainty.  

Ideally, I should trade with other people so that we both maximize a joint utility function, instead of each of us maximizing our own utility function. In the extreme, this looks like ECL. For most people, I'm not sure that this reasoning is necessary, however, because their intuitions might already be priced into my uncertainty over my CEV. 

Comment by Thomas Larsen (thomas-larsen) on Embedded Agency (full-text version) · 2023-04-10T19:21:54.376Z · LW · GW

In real world computers, we have finite memory, so my reading of this was assuming a finite state space. The fractal stuff requires an infinite sets, where two notions of smaller ('is a subset of' and 'has fewer elements') disagree -- the mini-fractal is a subset of the whole fractal, but it has the same number of elements and hence corresponds perfectly. 

Comment by Thomas Larsen (thomas-larsen) on Challenge: construct a Gradient Hacker · 2023-03-20T21:10:53.354Z · LW · GW

Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection , there is now a connection from the internal representation of  to the output, and so training will send this thing to the function 

Comment by Thomas Larsen (thomas-larsen) on Contra shard theory, in the context of the diamond maximizer problem · 2023-03-19T23:55:52.164Z · LW · GW

(My take on the reflective stability part of this) 

The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.

It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.

There’s a huge difference between “Whenever I see ice cream, I have the urge to eat it”, and “Eating ice cream is a fundamentally morally valuable atomic action”. The former roughly describes one of the shards that I have, and the latter is something that I don’t expect to see in my CEV. Similarly, I imagine that a bunch of the safety properties will look more like these urges because the shards will be relatively weak things that are bolted on to the main part of the cognition, not things that bid on the intelligent planning part. The non-reflectively endorsed shards will be seen as arbitrary code that is attached to the mind that the reflectively endorsed shards have to plan around (similar to how I see my “Whenever I see ice cream, I have the urge to eat it” shard.

In other words: there is convergent pressure for CEV-content integrity, but that does not mean that the current way of making decisions (e.g. shards) is close to the CEV optimum, and the shards will choose to self modify to become closer to their CEV.

I don't feel epistemically helpless here either, and would love a theory of which shards get preserved under reflection. 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-15T18:07:03.788Z · LW · GW

Some rough takes on the Carlsmith Report. 

Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:

  1. Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
  2. Incentives: There will be strong incentives to build and deploy APS-AI.
  3. Alignment difficulty: It will be much harder to build APS-AI systems that don’t seek power in unintended ways, than ones that would seek power but are superficially attractive to deploy.
  4. High-impact failures: Some deployed APS-AI systems will seek power in unintended and high-impact ways, collectively causing >$1 trillion in damage.
  5. Disempowerment: Some of the power-seeking will in aggregate permanently disempower all of humanity.
  6. Catastrophe: The disempowerment will constitute an existential catastrophe.

These steps defines a tree over possibilities. But the associated outcome buckets don't feel that reality carving to me. A recurring crux is that good outcomes are also highly conjunctive, i.e, one of these 6 conditions failing does not give a good AI outcome. Going through piece by piece: 

  1. Timelines makes sense and seems like a good criteria; everything else is downstream of timelines. 
  2. Incentives seems wierd. What does the world in which there are no incentives to deploy APS-AI look like? There are a bunch of incentives that clearly do impact people towards this already: status, desire for scientific discovery, power, money. Moreover, this doesn't seem necessary for AI x-risk -- even if we somehow removed the gigantic incentives to build APS-AI that we know exist, people might still deploy APS-AI because they personally wanted to, even though there weren't social incentives to do so. 
  3. Alignment difficulty is another non necessary condition. Some ways of getting x-risk without alignment being very hard: 
    1. For one, this is a clear spectrum, and even if it is on the really low end of the system, perhaps you only need a small amount of extra compute overhead to robustly align your system. One of the RAAP stories might occur, and even though technical alignment might be pretty easy, but the companies that spend that extra compute robustly aligning their AIs gradually lose out to other companies in the competitive marketplace. 
    2. Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship 
    3. Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe. 
  4. High-impact failures contains much of the structure, at least in my eyes. The main ways that we avoid alignment failure are worlds where something happens to take us off of the default trajectory: 
    1. Perhaps we make a robust coordination agreement between labs/countries that causes people to avoid deploying until they've solved alignment 
    2. Perhaps we solve alignment, and harden the world in some way, e.g. by removing compute access, dramatically improving cybersec, monitoring and shutting down dangerous training runs. 
    3. In general, thinking about the likelihood of any of these interventions that work, feels very important. 
  5. Disempowerment. This and (4), are very entangled with upstream things like takeoff shape. Also, it feels extremely difficult for humanity to not be disempowered. 
  6. Catastrophe. To avoid this, again, I need to imagine the extra structure upsteam of this, e.g. 4 was satisfied by a warning shot, and then people coordinated and deployed a benign sovreign that disempowered humanity for good reasons. 

My current preferred way to think about likelihood of AI risk routes through something like this framework, but is more structured and has a tree with more conjuncts towards success as well as doom. 

Comment by Thomas Larsen (thomas-larsen) on Shutting Down the Lightcone Offices · 2023-03-15T05:19:26.378Z · LW · GW

I think a really substantial fraction of people who are doing "AI Alignment research" are instead acting with the primary aim of "make AI Alignment seem legit". These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that's the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the AI.

This doesn't feel right to me, off the top of my head, it does seem like most of the field is just trying to make progress. For most of those that aren't, it feels like they are pretty explicit about not trying to solve alignment, and also I'm excited about most of the projects. I'd guess like 10-20% of the field are in the "make alignment seem legit" camp. My rough categorization:

Make alignment progress: 

  • Anthropic Interp  
  • Redwood 
  • ARC Theory
  • Conjecture 
  • MIRI 
  • Most independent researchers that I can think of (e.g. John, Vanessa, Steven Byrnes, the MATS people I know)  
  • Some of the safety teams at OpenAI/DM 
  • Aligned AI 
  • Team Shard

make alignment seem legit: 

  • CAIS
  • Anthropic scaring laws
  • ARC Evals (arguably, but it seems like this isn't quite the main aim) 
  • Some of the safety teams at OpenAI/DM  
  • Open Phil (I think I'd consider Cold Takes to be doing this, but it doesn't exactly brand itself as alignment research) 

What am I missing? I would be curious which projects you feel this way about.  

Comment by Thomas Larsen (thomas-larsen) on Shutting Down the Lightcone Offices · 2023-03-15T01:33:20.404Z · LW · GW

I personally benefitted tremendously from the Lightcone offices, especially when I was there over the summer during SERI MATS. Being able to talk to lots of alignment researchers and other aspiring alignment researchers increased my subjective rate of alignment upskilling by >3x relative to before, when I was in an environment without other alignment people.

Thanks so much to the Lightcone team for making the office happen. I’m sad (emotionally, not making a claim here whether it was the right decision or not) to see it go, but really grateful that it existed.

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-13T22:21:24.311Z · LW · GW
  1.  Because you have a bunch of shards, and you need all of them to balance each other out to maintain the 'appears nice' property. Even if I can't predict which ones will be self modified out, some of them will, and this could disrupt the balance. 
  2. I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards

    These are both handwavy enough that I don't put much credence in them. 
Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-13T21:31:16.139Z · LW · GW

Yeah good point, edited 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-13T20:21:37.815Z · LW · GW

For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.  

  • {reflectively stable, general} -> do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory. 
  • {general, embedded} -> use ML/local search with enough compute to rehash evolution and get smart agents out. 
  • {reflectively stable, embedded} -> a sponge or a current day ML system. 
Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-13T19:34:21.416Z · LW · GW

Some thoughts on inner alignment. 

1. The type of object of a mesa objective and a base objective are different (in real life) 
In a cartesian setting (e.g. training a chess bot), the outer objective is a function , where  is the state space, and  are the trajectories. When you train this agent, it's possible for it to learn some internal search and mesaobjective , since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that evaluates winningness of the board, and then gives a higher utility to the winning boards. 

In an embedded setting, the outer objective cannot see an entire world trajectory like it could in the cartesian setting. Your loss can see the entire trajectory of a chess game, but you loss can't see an entire atomic level representation of the universe at every point in the future. If we're trying to get an AI to care about future consequences over trajectories  will have to have type , though it won't actually represent a function of this type because it can't, it will instead represent its values some other way (I don't really know how it would do this -- but (2) talks about the shape in ML). Our outer objective will have a much shallower type, , where  are some observable latents. This means that trying to set get  to equal  doesn't even make sense as they have different type signatures. To salvage this, one could assume that  factors into , where  is a model of the world and  is an objective, but it's impossible to actually compute  this way. 

2. In ML models, there is no mesa objective, only behavioral patterns. More generally,  AI's can't naively store explicit mesaobjectives, they need to compress them in some way / represent them differently. 

My values are such that I do care about the entire trajectory of the world, yet I don't store a utility function with that type signature in my head. Instead of learning a goal over trajectories, ML models will have behavioral patterns that lead to states that performed well according to the outer objective on the training data. 

I have a behavioral pattern that says something like 'sugary thing in front of me -> pick up the sugary thing and eat it'. However, this doesn't mean that I reflectively endorse this behavioral pattern. If I was designing myself again from scratch or modifying my self, I would try to remove this behavioral pattern. 

This is the main-to-me reason why I don't think that the shard theory story of reflective stability holds up.[1]  A bunch of the behavioral patterns that caused the AI to look nice during training will not get handed down into successor agents / self modified AIs. 

Even in theory, I don't yet know how to make reflectively stable, general, embedded cognition (mainly because of this barrier). 

  1. ^

    From what I understand, the shard theory story of reflective stability is something like: The shards that steer the values have an incentive to prevent themselves from getting removed. If you have a shard that wants to get lots of paperclips, the action that removes this shard from the mind would result in less paperclips being gotten. 
    Another way of saying this is that goal-content integrity is convergently instrumental, so reflective stability will happen by default. 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-13T06:24:08.553Z · LW · GW

Current impressions of free energy in the alignment space. 

  1. Outreach to capabilities researchers. I think that getting people who are actually building the AGI to be more cautious about alignment / racing makes a bunch of things like coordination agreements possible, and also increases the operational adequacy of the capabilities lab. 
    1. One of the reasons people don't like this is because historically outreach hasn't gone well, but I think the reason for this is that mainstream ML people mostly don't buy "AGI big deal", whereas lab capabilities researchers buy "AGI big deal" but not "alignment hard". 
    2. I think people at labs running retreats, 1-1s, alignment presentations within labs are all great to do this. 
    3. I'm somewhat unsure about this one because of downside risk and also 'convince people of X' is fairly uncooperative and bad for everyone's epistemics. 
  2. Conceptual alignment research addressing the hard part of the problem. This is hard and not easy to transition to without a bunch of upskilling, but if the SLT hypothesis is right, there are a bunch of key problems that mostly go unnassailed, and so there's a bunch of low hanging fruit there. 
  3. Strategy research on the other low hanging fruit in the AI safety space. Ideally, the product of this research would be a public quantitative model about what interventions are effective and why. The path to impact here is finding low hanging fruit and pointing them out so that people can do them. 
Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-03-10T20:21:01.607Z · LW · GW

Thinking a bit about takeoff speeds.

As I see it, there are ~3 main clusters: 

  1. Fast/discountinuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they aren't really doing anything that meaningful. 
  2. Slow/continuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they do alter the economy significantly 
  3. Perenial slowness. Once AIs are doing the bulk of AI research, there is no foom even still, maybe because of compute bottlenecks, and so there is sort of constant rates of improvements that do alter things. 

It feels to me like multipolar scenarios mostly come from 3, because in either 1 or 2, the pre-foom state is really unstable, and eventually some AI will foom and become unipolar. In a continuous takeoff world, I expect small differences in research ability to compound over time. In a discontinuous takeoff, the first model to make the jump is the thing that matters.

3 also feels pretty unlikely to me, given that I expect running AIs to be cheap relative to training, so you get the ability to copy and scale intelligent labor dramatically, and I expect the AIs to have different skillsets than humans, and so be able to find low hanging fruit that humans missed. 

Comment by Thomas Larsen (thomas-larsen) on Challenge: construct a Gradient Hacker · 2023-03-10T01:51:14.539Z · LW · GW

At least under most datasets, it seems to me like the zero NN fails condition 2a, as perturbing the weights will not cause it to go back to zero. 

Comment by Thomas Larsen (thomas-larsen) on Challenge: construct a Gradient Hacker · 2023-03-10T00:51:57.018Z · LW · GW

This is a plausible internal computation that the network could be doing, but the problem is that the gradients flow back through from the output to the computation of the gradient to the true value y, and so GD will use that to set the output to be the appropriate true value. 

Comment by Thomas Larsen (thomas-larsen) on Challenge: construct a Gradient Hacker · 2023-03-10T00:38:06.239Z · LW · GW

This feels like cheating to me, but I guess I wasn't super precise with 'feedforward neural network'. I meant 'fully connected neural network', so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as 

where the weight matrices are some nice function of  (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in  and produces the  matrices that are actually used in the forward pass.)

I guess I should be more precise about what 'nice means', to rule out weight sharing functions that always zero out input, but it turns out this is kind of tricky. Let's require the weight sharing function  to be differentiable and have image that satisfies  for any projection. (A weaker condition is if the weight sharing function can only duplicate parameters). 

Comment by Thomas Larsen (thomas-larsen) on Are short timelines actually bad? · 2023-02-08T01:02:17.240Z · LW · GW

Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush

Agree that this is an effect. The reason it wasn't immediately as salient is because I don't expect the governance upside to outweigh the downside of more time for competition. I'm not confident of this and I'm not going to write down reasons right now. 

(As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

Agree, though I think on the current margin US companies have several years of lead time on China, which is much more than they have on each other. So on the current margin, I'm more worried about companies racing each other than nations. 

More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":

  • The salient effects of "shorter timelines" seem pretty dependent on what the baseline is.
    • The point about China seems very important if the baseline is 30 years, and not so much if the baseline is 10 years.
  • The salient effects of "slowing takeoff" seem pretty dependent on what part of the curve is being slowed. Slowing it down right before there's large risk seems much more valuable than (just) slowing it down earlier in the curve, as the last few year's investments in LLMs did.


Comment by Thomas Larsen (thomas-larsen) on Are short timelines actually bad? · 2023-02-06T00:18:58.236Z · LW · GW

My take on the salient effects: 

Shorter timelines -> increased accident risk from not having solved technical problem yet, decreased misuse risk, slower takeoffs

Slower takeoffs -> decreased accident risk because of iteration to solve technical problem, increased race / economic pressure to deploy unsafe model

Given that most of my risk profile is dominated by a) not having solved technical problem yet, and b) race / economic pressure to deploy unsafe models, I'm tentatively in the long timelines + fast takeoff quadrant as being the safest. 

Comment by Thomas Larsen (thomas-larsen) on Disentangling Shard Theory into Atomic Claims · 2023-01-13T19:29:15.992Z · LW · GW

This is exemplified by John Wentworth's viewpoint that successfully Retargeting the Search is a version of solving the outer alignment problem.

Could you explain what you mean by this? IMO successfully retargeting the search solves inner alignment but it leaves unspecified the optimization target. Deciding what to target the search at seems outer alignment-shaped to me. 

Also, nice post! I found it clear. 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2023-01-11T20:02:10.582Z · LW · GW

There are several game theoretic considerations leading to races to the bottom on safety. 

  1. Investing resources into making sure that AI is safe takes away resources to make it more capable and hence more profitable. Aligning AGI probably takes significant resources, and so a competitive actor won't be able to align their AGI. 
  2. Many of the actors in the AI safety space are very scared of scaling up models, and end up working on AI research that is not at the cutting edge of AI capabilities. This should mean that the actors at the cutting edge tend to be the actors who are most optimistic about alignment going well, and indeed, this is what we see. 
  3. Because of foom, there is a winner takes all effect: the first person to deploy AGI that fooms gets almost all of the wealth and control from this (conditional on it being aligned). Even if most actors are well intentioned, they feel like they have to continue on towards AGI before a misaligned actor arrives at AGI. A common (valid) rebuttal from the actors at the current edge to people who ask them to slow down is 'if we slow down, then China gets to AGI first'. 
  4. There's the unilateralists curse: there only needs to be one actor pushing on and making more advanced dangerous capable models to cause an x-risk. Coordination between many actors to prevent this is really hard, especially with the massive profits in creating a better AGI. 
  5. Due to increasing AI hype, there will be more and more actors entering the space, making coordination harder, and making the effect of a single actor dropping out become smaller. 
Comment by Thomas Larsen (thomas-larsen) on Best introductory overviews of AGI safety? · 2022-12-13T21:13:13.092Z · LW · GW

My favorite for AI researchers is Ajeya's Without specific countermeasures, because I think it does a really good job being concrete about a training set up leading to deceptive alignment. It also is sufficiently non-technical that a motivated person not familiar with AI could understand the key points. 

Comment by Thomas Larsen (thomas-larsen) on Finite Factored Sets in Pictures · 2022-12-05T01:49:01.766Z · LW · GW

It means 'is a subset of but not equal to' 

Comment by Thomas Larsen (thomas-larsen) on Provably Honest - A First Step · 2022-12-03T17:31:12.175Z · LW · GW

This seems interesting and connected to the idea of using a speed prior to combat deceptive alignment

This is a model-independent way of proving if an AI system is honest.

I don't see how this is a proof, it seems more like a heuristic. Perhaps you could spell out this argument more clearly? 

Also, it is not clear to me how to use a timing attack in the context of a neural network, because in a standard feedforward network, all parameter settings will use the same amount of computation in a forward pass and hence run in the same amount of time. Do you have a specific architecture in mind, or are you just reasoning about arbitrary AGI systems? I think in the linked article above there are a couple ideas of how to vary the amount of time neural networks take :). 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2022-11-08T23:37:54.326Z · LW · GW

I'm excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM -- I'm excited about people posting any ideas here. :) 

Comment by Thomas Larsen (thomas-larsen) on Thomas Larsen's Shortform · 2022-11-08T23:34:10.441Z · LW · GW

Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior. 

There are two ways for deception to appear: 

  1. An action chosen instrumentally due to non-myopic future goals that are better achieved by deceiving humans now so that it has more power to achieve its goals in the future. 
  2. Because deception was directly selected for as an action. 

Another way of describing the difference is that 1 follows from an inner alignment failure: a mesaoptimizer learned an unintended mesaobjective that performs well on training, while 2 follows from an outer alignment failure — an imperfect reward signal. 

Classic discussion of deception focuses on 1 (example 1example 2),  but I think that 2 is very important as well, particularly because the most common currently used alignment strategy is RLHF, which actively selects for deception.

Once the AI has the ability to create strategies that involve deceiving the human, even without explicitly modeling the human, those strategies will win out and end up eliciting a lot of reward. This is related to the informed oversight problem: it is really hard to give feedback to a model that is smarter than you. I view this as a key problem with RLHF. To my knowledge very little work has been done exploring this and finding more empirical examples of RLHF models learning to deceive the humans giving it feedback, which is surprising to me because it seems like it should be possible. 

Comment by Thomas Larsen (thomas-larsen) on So, geez there's a lot of AI content these days · 2022-10-09T16:32:48.688Z · LW · GW

One major reason why there is so much AI content on LessWrong is that very few people are allowed to post on the Alignment Forum.

Everything on the alignment forum gets crossposted to LW, so letting more people post on AF wouldn't decrease the amount of AI content on LW. 

Comment by Thomas Larsen (thomas-larsen) on (My understanding of) What Everyone in Technical Alignment is Doing and Why · 2022-10-09T15:55:57.649Z · LW · GW

Sorry for the late response, and thanks for your comment, I've edited the post to reflect these. 

Comment by Thomas Larsen (thomas-larsen) on 7 traps that (we think) new alignment researchers often fall into · 2022-10-08T22:58:16.855Z · LW · GW

I have the intuition (maybe from applause lights) that if negating a point sounds obviously implausible, then the point is obviously true and it is therefore somewhat meaningless to claim it. 

My idea in writing this was to identify some traps that I thought were non obvious (some of which I think I fell into as new alignment researcher). 

Comment by Thomas Larsen (thomas-larsen) on Warning Shots Probably Wouldn't Change The Picture Much · 2022-10-06T14:04:39.362Z · LW · GW

Disclaimer: writing quickly. 

Consider the following path: 

A. There is an AI warning shot. 

B. Civilization allocates more resources for alignment and is more conservative pushing capabilities.  

C. This reallocation is sufficient to solve and deploy aligned AGI before the world is destroyed. 

I think that a warning shot is unlikely (P(A) < 10%), but won't get into that here. 

I am guessing that P(B | A) is the biggest crux. The OP primarily considers the ability of governments to implement policy that moves our civilization further from AGI ruin, but I think that the ML community is both more important and probably significantly easier to shift than government.  I basically agree with this post as it pertains to government updates based on warning shots. 

I anticipate that a warning shot would get most capabilities researchers to a) independently think about alignment failures and think about the alignment failures that their models will cause, and b) take the EA/LessWrong/MIRI/Alignment sphere's worries a lot more seriously. My impression is that OpenAI seems to be much more worried about misuse risk than accident risk: if alignment is easy, then the composition of the lightcone is primarily determined by the values of the AGI designers.  Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI. I think a warning shot would dramatically update them towards worry towards worry about accident risk, and therefore I anticipate that OpenAI would drastically shift most of their resources to alignment research. I would guess P(B|A) ~= 80%. 

 P(C | A, B) primarily depends on alignment difficulty, of which I am pretty uncertain, and also how large the reallocation in B is, which I am anticipating to be pretty large. The bar for destroying the world gets lower and lower every year, but this would give us a lot more time, but I think we get several years of AGI capabiliity before we deploy it. I'm estimating P(C | A, B) ~= 70%, but this is very low resilience. 

Comment by Thomas Larsen (thomas-larsen) on Neural Tangent Kernel Distillation · 2022-10-05T19:22:59.877Z · LW · GW

Hmm, the eigenfunctions just depend on the input training data distribution (which we call ), and in this experiment, they are distributed evenly on the interval . Given that the labels are independent of this, you'll get the same NTK eigendecomposition regardless of the target function. 

I'll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input points. 

Comment by Thomas Larsen (thomas-larsen) on What is the best critique of AI existential risk arguments? · 2022-09-07T23:43:48.340Z · LW · GW

An anonymous academic wrote a review of Joe Carlsmith's 'Is power seeking AI an existential risk?', in which the reviewer assigns for <1/100,000 probability of AI existential risk. The arguments given aren't very good imo, but maybe worth reading.