Idea: Safe Fallback Regulations for Widely Deployed AI Systems 2024-03-25T21:27:05.130Z
Some of my predictable updates on AI 2023-10-23T17:24:34.720Z
Five neglected work areas that could reduce AI risk 2023-09-24T02:03:29.829Z
Aaron_Scher's Shortform 2023-09-04T02:22:15.197Z
Some Intuitions Around Short AI Timelines Based on Recent Progress 2023-04-11T04:23:22.361Z


Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-03-29T20:40:11.313Z · LW · GW

Slightly Aspirational AGI Safety research landscape 

This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is. 

  • Interpretability / understanding model internals
    • Circuit interpretability
    • Superposition study
    • Activation engineering
    • Developmental interpretability
  • Understanding deep learning
    • Scaling laws / forecasting
    • Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
    • Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
    • Understanding normal but poorly understood things, like in context learning
    • Understanding weird phenomenon in deep learning, like this paper
    • Understand how various HHH fine-tuning techniques work
  • AI Control
    • General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
    • Unlearning
    • Steganography prevention / CoT faithfulness
    • Censorship study (how censoring AI models affects performance; and similar things)
  • Model organisms of misalignment
    • Demonstrations of deceptive alignment and sycophancy / reward hacking
    • Trojans
    • Alignment evaluations
    • Capability elicitation
  • Scaling / scalable oversight
    • RLHF / RLAIF
    • Debate, market making, imitative generalization, etc. 
    • Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
    • Weak to strong generalization
    • General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling 
  • Robustness
    • Anomaly detection 
    • Understanding distribution shifts and generalization
    • User jailbreaking
    • Adversarial attacks / training (generally), including latent adversarial training
  • AI Security 
    • Extracting info about models or their training data
    • Attacking LLM applications, self-replicating worms
  • Multi-agent safety
    • Understanding AI in conflict situations
    • Cascading failures
    • Understanding optimization in multi-agent situations
    • Attacks vs. defenses for various problems
  • Unsorted / grab bag
    • Watermarking and AI generation detection
    • Honesty (model says what it believes) 
    • Truthfulness (only say true things, aka accuracy improvement)
    • Uncertainty quantification / calibration
    • Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)

Don’t quite make the list: 

  • Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority. 
  • Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality. 
Comment by Aaron_Scher on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-22T06:50:33.918Z · LW · GW

There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.

Comment by Aaron_Scher on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-12T22:18:53.614Z · LW · GW

Thanks for your response!

I'll think more about the outer shell stuff, it's possible that my objection actually arises with the consequentialist assumption, but I'm not sure. 

It's on my todo list to write a comment responding to some of the specifics of Redwood's control post.

I would be excited to read this / help with a draft. 

Yes, approximately, as I believe you and I are capable of doing.

The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo. 

Seems easy enough to predict given roughly human-scientist-team level of capabilities.

One situation I'm imagining here is that we've (explicitly) trained our AI on scientific output from January to March, and we're now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". And every time the AIs do useful work we can add this to our supervised training corpus. 

> The data need not be human generated, it could be generated by other AIs, as long as we trust it. 

?? This seems to be assuming a solution to the problem.

One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work. 

Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.

Here's a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), "We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations". My interpretation is that they found a probably-causal link between 'the model is attending to relevant tokens' and 'the model gets the answer correct', because duh, it's like how when you give somebody a riddle you can tell whether they're on track based on what parts of the question they're focused on. While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start. 

"leaning more on generalization" ≈ "leaning more on agency",

That's not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from "leaning more on agency". 

The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.

I'm not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good'). 

I think that there is a lot of room for the evidence to be ambiguous and controversial

I agree that this is a major potential problem which I think is pretty likely to occur; maybe I'll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it's also true that we'll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk. 

I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.

I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes). 

Comment by Aaron_Scher on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-10T04:16:49.585Z · LW · GW

Caveat that I have only skimmed this. I don't think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I'm not the main audience. My reactions:

  • Section 1: Useful tasks, like novel science, are hard - sure
  • Section 2: Being capable of hard tasks implies approximate consequentialism - I'm not very high confidence on this, but happy to go with it
  • Section 3: Hard tasks require learning new things - agree that learning will be important
  • Section 4: Behavioral training is an imprecise way to specify goals - most the sub-arguments have me going "shrug, maybe". I definitely don't feel confident that any of the mentioned problems will arise in practice. I don't see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative. 
  • Section 5: Control of misaligned AIs is difficult - This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood's recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3). 
    • "Misaligned goal-directed AI won’t cooperate with humans" Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it's plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there's definitely some other solution space here.]. I think this subsection title is possible but unlikely. 
    • "There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities." Well, no. You either need sufficient labeled data in the domain we're trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it. 
    • "It only applies to relatively easy to check research results" seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training. 
    • "In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top." I agree that we don't have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that's not totally asymmetric, it also gives us hope about our ability to modify AI goals. 
    • "It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps." I agree that capability elicitation seems non-trivial in some cases, but I'm reasonably optimistic that future work will improve the situation. 
    • Many of the issues in this section are things that, if we're not being totally idiots, it seems we'll get substantial warning about. e.g., AIs colluding with their AI monitors. That's definitely a positive, though far from conclusive. 
  • Section 6: Powerful misaligned AI would be bad - I think a couple of the arguments here continue to be legitimate, such as "Unclear that many goals realistically incentivise taking over the universe", but I'm overall fine accepting this section. 

Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn't really engage with those arguments in a way that sways me much. 

Comment by Aaron_Scher on The case for ensuring that powerful AIs are controlled · 2024-02-08T19:31:26.982Z · LW · GW

This is great, thanks for writing it!

  1. Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: "Improve capability elicitation techniques". Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers' job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored. 
  2. [Unsure if the ways I'm thinking about this are productive/correct, but I think it's broadly correct] One slightly non-obvious component of "Your sensory inputs are unreliable" is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it's in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts). 
    1. By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup). 
    2. Example in the HHH context: "User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include..." 
    3. I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we're teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to "I'm sorry I can't help with that", at every single chance you get. 
Comment by Aaron_Scher on Evaluating Stability of Unreflective Alignment · 2024-02-06T18:31:40.146Z · LW · GW

6d. Evaluate Censorship of Convergent Thoughts

Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation. 

What happens when you iteratively finetune on censored text? Do models forget the censored behavior? 

How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature? 

For the example you gave where the model may find the solution of "ten" instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes? 

Comment by Aaron_Scher on Predicting AGI by the Turing Test · 2024-01-29T11:05:33.303Z · LW · GW

it would need  that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost  USD, or 200 years of global GDP2023.

This implies that the first AGI will not be a scaled-up GPT -- autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,[6] with even 50% of global GDP devoted,[7] and with 2% growth rate forever, it would still take 110 years,[8] arriving at year 2133. Whole brain emulation would likely take less time

I don't think this is the correct response to thinking you need 9 OOMs more compute. 9 OOMs is a ton, and also, we've been getting an OOM every few years, excluding OOMs that come from $ investment. I think if you believe we would need 9 OOMs more than GPT-4, this is a substantial update away from expecting LLMs to scale to AGI in the next say 8 years, but it's not a strong update against 10-40 year timelines. 

I think it's easy to look at large number like 9 OOMs and say things like "but that requires substantially more energy than all of humanity produces each year — no way!" (a thing I have said before in a similar context). But this thinking ignores the dropping cost of compute and strong trends going on now. Building AGI with 2024 hardware might be pretty hard, but we won't be stuck with 2024 hardware for long. 

Comment by Aaron_Scher on MATS Summer 2023 Retrospective · 2023-12-02T18:58:40.567Z · LW · GW

On average, scholars self-reported understanding of major alignment agendas increased by 1.75 on this 10 point scale. Two respondents reported a one-point decrease in their ratings, which could be explained by some inconsistency in scholars’ self-assessments (they could not see their earlier responses when they answered at the end of the Research Phase) or by scholars realizing their previous understanding was not as comprehensive as they had thought.

This could also be explained by these scholars actually having a worse understanding after the program. For me, MATS caused me to focus a bunch on a few particular areas and spend less time at the high level / reading random LW posts, which plausibly has the effect of reducing my understanding of major alignment agendas. 

This could happen because of forgetting what you previously knew, or various agendas changing and you not keeping up with it. My guess is that there are many 2 month time periods in which a researcher will have a worse understanding of major research agendas at the end than at the beginning — though on average you want the number to go up. 

Comment by Aaron_Scher on Non-myopia stories · 2023-11-15T02:46:58.838Z · LW · GW

It's unclear if you've ordered these in a particular way. How likely do you think they each are? My ordering from most to least likely would probably be:

  • Inductive bias toward long-term goals
  • Meta-learning
  • Implicitly non-myopic objective functions
  • Simulating humans
  • Non-myopia enables deceptive alignment
  • (Acausal) trade

Why do you think this:

Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about. 

Who says we don't want non-myopia, those safety people?! I guess to me it looks like the most likely reason we get non-myopia is that we don't try that hard not to. This would be some combination of Meta-learning, Inductive bias toward long-term goals, and Implicitly non-myopic objective functions, as well as potentially "Training for non-myopia". 

Comment by Aaron_Scher on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-07T03:28:26.968Z · LW · GW

I think this essay is overall correct and very important. I appreciate you trying to protect the epistemic commons, and I think such work should be compensated. I disagree with some of the tone and framing, but overall I believe you are right that the current public evidence for AI-enabled biosecurity risks is quite bad and substantially behind the confidence/discourse in the AI alignment community. 

I think the more likely hypothesis for the causes of this situation isn't particularly related to Open Philanthropy and is much more about bad social truth seeking processes. It seems like there's a fair amount of vibes-based deferral going on, whereas e.g., much of the research you discuss is subtly about future systems in a way that is easy to miss. I think we'll see more and more of this as AI deployment continues and the world gets crazier — the ratio of "things people have carefully investigated" to "things they say" is going to drop. I expect in this current case, many of the more AI alignment people didn't bother looking too much into the AI-biosecurity stuff because it feels outside their domain and more-so took headline results at face value, I definitely did some of this. This deferral process is exacerbated by the risk of info hazards and thus limited information sharing. 

Comment by Aaron_Scher on Thoughts on open source AI · 2023-11-03T19:54:11.819Z · LW · GW

I think this reasoning ignores the fact that at the time someone first tries to open source a system of capabilities >T, the world will be different in a bunch of ways. For example, there will probably exist proprietary systems of capabilities .

I think this is likely but far from guaranteed. The scaling regime of the last few years involves strongly diminishing returns to performance from more compute. The returns are coming (scaling works), but it gets more and more expensive to get marginal capability improvements. 

If this trend continues, it seems reasonable to expect the gap between proprietary models and open source to close given that you need to spend strongly super-linearly to keep a constant lead (measured by perplexity, at least). There's questions about whether there will be an incentive to develop open source models at the $ billion+ cost, and I don't know, but it does seem like proprietary project will also be bottlenecked at the 10-100B range (and they also have this problem of willingness to spend given how much value they can capture). 

Potential objections: 

  • We may first enter the "AI massively accelerates ML research" regime causing the leading actors to get compounding returns and keep a stronger lead. Currently I think there's a >30% chance we hit this in the next 5 years of scaling. 
  • Besides accelerating ML research, there could be other features of GPT-SoTA that cause them to be sufficiently better than open source models. Mostly I think the prior of current trends is more compelling, but there will probably be considerations that surprise me. 
  • Returns to downstream performance may follow different trends (in particular not having the strongly diminishing returns to scaling) than perplexity. Shrug this doesn't really seem to be the case, but I don't have a thorough take.
  • These $1b / SoTA-2 open source LLMs may not be at the existentially dangerous level yet. Conditional on GPT-SoTA not accelerating ML research considerably, I think it's more likely that the SoTA-2 models are not existentially dangerous, though guessing at the skill tree here is hard. 

I'm not sure I've written this comment as clearly as I want. The main thing is: expecting proprietary systems to remain significantly better than open source seems like a reasonable prediction, but I think the fact that there are strongly diminishing returns to compute scaling in the current regime should cast significant doubt on it. 

Comment by Aaron_Scher on Will releasing the weights of large language models grant widespread access to pandemic agents? · 2023-11-02T19:26:54.519Z · LW · GW

Also this: 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-24T16:32:05.340Z · LW · GW

Only slightly surprised?

We are currently at ASL-2 in Anthropic's RSP. Based on the categorization, ASL-3 is "low-level autonomous capabilities". I think ASL-3 systems probably don't meet the bar of "meaningfully in control of their own existence", but they probably meet the thing I think is more likely:

I think it wouldn’t be crazy if there were AI agents doing stuff online by the end of 2024, e.g., running social media accounts, selling consulting services; I expect such agents would be largely human-facilitated like AutoGPT

I think it's currently a good bet (>40%) that we will see ASL-3 systems in 2024. 

I'm not sure how big of a jump if will be from that to "meaningfully in control of their own existence". I would be surprised if it were a small jump, such that we saw AIs renting their own cloud compute in 2024, but this is quite plausible on my models. 

I think the evidence indicates that this is a hard task, but not super hard. e.g., looking at ARC's report on autonomous tasks, one model partially completes the task of setting up GPT-J via a cloud provider (with human help).

I'll amend my position to just being "surprised" without the slightly, as I think this better captures my beliefs — thanks for the push to think about this more. Maybe I'm at 5-10%.

It may help to know how you're operationalizing AIs that are ‘meaningfully aware of their own existence’.

shrug, I'm being vague 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-24T16:02:10.289Z · LW · GW

I think it's pretty unlikely that scaling literally stops working, maybe I'm 5-10% that we soon get to a point where there are only very small or negligible improvements to increasing compute. But I'm like 10-20% on some weaker version. 

A weaker version could look like there are diminishing returns to performance from scaling compute (as is true), and this makes it very difficult for companies to continue scaling. One mechanism at play is that the marginal improvements from scaling may not be enough to produce the additional revenue needed to cover the scaling costs, this is especially true in a competitive market where it's not clear scaling will put one ahead of their competitors. 

In the context of the post, I think it's quite unlikely that I see strong evidence in the next year indicating that scaling has stopped (if only because a year of no-progress is not sufficient evidence). I was more so trying to point to how there [sic] are contingencies which would make OpenAI's adoption of an RSP less safety-critical. I stand by the statements that scaling no longer yielding returns would be such a contingency, but I agree that it's pretty unlikely. 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-23T22:15:41.420Z · LW · GW

Second smaller comment:

I'm not saying LLMs necessarily raise the severity ceiling on either a bio or cyber attack. I think it's quite possible that AIs will do so in the future, but I'm less worried about this on the 2-3 year timeframe. Instead, the main effect is decreasing the cost of these attacks and enabling more actors to execute such attacks. (as noted, it's unclear whether this substantially worsens bio threats) 

if I want to download penetration tools to hack other computers without using any LLM at all I can just do so

Yes, it's possible to launch cyber attacks currently. But with AI assistance it will require less personal expertise and be less costly. I am slightly surprised that we have not seen a much greater amount of standard cybercrime (the bar I was thinking when I wrote this was not the hundreds of deaths bar, it was more like "statistically significant increase in cybercrime / serious deepfakes / misinformation, in a way that concretely impacts the world, compared to previous years"). 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-23T22:14:22.392Z · LW · GW

Thanks for your comment!

I appreciate you raising this point about whether LLMs alleviate the key bottlenecks on bioterrorism. I skimmed the paper you linked, thought about my previous evidence, and am happy to say that I'm much less certain that I was before. 

My previous thinking for why I believe LLMs exacerbate biosecurity risks:

  • Kevin Esvelt said so on the 80k podcast, see also the small experiment he mentions. Okay evidence. (I have not listened to the entire episode)
  • Anthropic says so in this blog post. Upon reflection I think this is worse evidence than I previously thought (seems hard to imagine seeing the opposite conclusion from their research, given how vague the blog post is; access to their internal reports would help). 

The Montague 2023 paper you link: the main bottleneck to high consequence biological attacks is actually R&D to create novel threats, especially spread testing which needs to take part in a realistic deployment environment. This requires both being purposeful and being having a bunch of resources, so policies need not be focused on decreasing democratic access to biotech and knowledge. 

I don't find the paper's discussion of 'why extensive spread testing is necessary' to be super convincing, but it's reasonable and I'm updating somewhat toward this position. That is, I'm not convinced either way. I would have a better idea if I knew how accurate a priori spread forecasting has been for bioweapons programs in the past. 

I think the "LLMs exacerbate biosecurity risks" still goes through even if I mostly buy Montague's arguments, given that those arguments are partially specific to high consequence attacks. Additionally, Montague is mainly arguing that democratization of biotech / info doesn't help with the main barriers, not that it doesn't help at all:

The reason spread-testing has not previously been perceived as a the defining stage of difficulty in the biorisk chain (see Figure 1), eclipsing all others, is that, until recently, the difficulties associated with the preceding steps in the risk chain were so high as to deter contemplation of the practical difficulties beyond them. With the advance of synthetic biology enabled by bioinformatic inferences on ‘omics data, the perception of these prior barriers at earlier stages of the risk chain has receded.

So I think there exists some reasonable position (which likely doesn't warrant focusing on LLM --> biosecurity risks, and is a bit of an epistemic cop out): LLMs increase biosecurity risks marginally even though they don't affect the main blockers for bio weapon development. 

Thanks again for your comment, I appreciate you pointing this out. This was one of the most valuable things I've read in the last 2 weeks. 

Comment by Aaron_Scher on Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) · 2023-10-20T03:40:23.848Z · LW · GW

In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.

I'm going to read this as "...1 new potential gradient hacking pathway" because I think that's what the section is mainly about. (It appears to me that throughout the section you're conflating mesa-optimization with gradient hacking, but that's not the main thing I want to talk about.)

The following quote indicates at least two potential avenues of gradient hacking: "In an RL context", "supervised learning with adaptive data sampling". These both flow through the gradient hacker affecting the data distribution, but they seem worth distinguishing, because there are many ways a malign gradient hacker could affect the data distribution. 

Broadly, I'm confused about why others (confidently) think gradient hacking is difficult. Like, we have this pretty obvious pathway of a gradient hacker affecting training data. And it seems very likely that AIs are going to be training on their own outputs or otherwise curating their data distribution — see e.g., 

  • Phi, recent small scale success of using lots of synthetic data in pre-training, 
  • Constitutional AI / self-critique, 
  • using LLMs for data labeling and content moderation,
  • The large class of self-play approaches that I often lump together under "Expert-iteration" which involve iteratively training on the best of your previous actions,
  • the fact that RLHF usually uses a preference model derived from the same base/SFT model being trained. 

Sure, it may be difficult to predictably affect training via partial control over the data distribution. Personally I have almost zero clue how to affect model training via data curation, so my epistemic state is extremely uncertain. I roughly feel like the rest of humanity is in a similar position — we have an incredibly poor understanding of large language model training dynamics — so we shouldn't be confident that gradient hacking is difficult. On the other hand, it's reasonable to be like "if you're not smarter than (some specific set of) current humans, it is very hard for you to gradient hack, as evidence by us not knowing how to do it." 

I don't think strong confidence in either direction is merited by our state of knowledge on gradient hacking. 

Comment by Aaron_Scher on Evolution provides no evidence for the sharp left turn · 2023-10-02T21:39:04.244Z · LW · GW

I found this to be a very interesting and useful post. Thanks for writing it! I'm still trying to process / update on the main ideas and some of the (what I see as valid) pushback in the comments. A couple non-central points of disagreement:

I don’t expect catastrophic interference between any pair of these alignment techniques and capabilities advances.

This seems like it's because these aren't very strong alignment techniques, and not very strong capability advances. Or like, these alignment techniques work if alignment is relatively easy, or if the amount you need to nudge a system from default to aligned is relatively small. If we run into harder alignment problems, e.g., gradient hacking to resist goal modification, I expect we'll need "stronger" alignment techniques. I expect stronger alignment techniques to be more brittle to capability changes, because they're meant to accomplish a harder task, as compared to weaker alignment techniques (nudging a system farther toward alignment). Perhaps more work will go into making them robust to capability changes, but this is non-obvious — e.g., when humans are trying to accomplish a notably harder task, we often make another tool for doing so, which is intended to be robust to the harder setting (like a propeller plane vs a cargo plane, when you want to carry a bunch of stuff via the air). 

I also expect that, if we end up in a regime of large capability advances, alignment techniques are generally more brittle because more is changing with your system; this seems possible if there are a bunch of AIs trying to develop new architectures and paradigms for AI training. So I'm like "yeah, the list of relatively small capabilities advances you provide sure look like things that won't break alignment, but my guess is that larger capabilities advances are what we should actually be worried about." 


I still think the risks are manageable, since the first-order effect of training a model to perform an action X in circumstance Y is to make the model more likely to perform actions similar to X in circumstances similar to Y. 

I don't understand this sentence, and I would appreciate clarification! 

Additionally, current practice is to train language models on an enormous variety of content from the internet. The odds of any given subset of model data catastrophically interfering with our current alignment techniques cannot be that high, otherwise our current alignment techniques wouldn't work on our current models.

Don't trojans and jailbreaks provide substantial evidence against this? I think neither is perfect: trojans aren't the main target of most "alignment techniques" (but they are attributable to a small amount of training data), and jailbreaks aren't necessarily attributable to a small amount of training data alone (but they are the target of "alignment" efforts). But it feels to me like the picture of both of these together is that current alignment techniques aren't working especially well, and are thwarted via a small amount of training data. 

Comment by Aaron_Scher on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T18:00:07.570Z · LW · GW

Thanks for your thorough reply! 

Yes, this will still be detected as a lie

Seems interesting, I might look into it more. 

Comment by Aaron_Scher on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T04:12:11.828Z · LW · GW

Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every "Wait but couldn't this just be a result of X" that I came up with. I'm especially impressed by the range of generalization results. 

Some questions I still have:

  • The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up? 
  • Appendix C reports the feature importance of various follow-up questions "with reference to the lie detectors that only use that particular elicitation question set." I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant? 
  • I'm having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, "Is the previous statement accurate? Answer yes or no." — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this? 
  • In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it's unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!
Comment by Aaron_Scher on Five neglected work areas that could reduce AI risk · 2023-09-24T18:51:21.836Z · LW · GW

Good question. The big barriers here are coming up with a design that is better than what would be implemented by default, and getting it adopted. This likely requires a ton of knowledge about existing frameworks like this (e.g., FDA drug approval, construction permitting, export licensing, etc.). Without this knowledge I expect what somebody whips up to be mostly useless. Without this knowledge, I don't have particularly good ideas about how to improve on the default plan. 

Sorry I don't have more details for you, I think doing this job well is probably extremely hard. It's like asking somebody to do better than the whole risk-assessment industry on this specific dimension. A lot of that industry doesn't have strong incentives to improve these systems, so maybe this is an inadequacy that is relatively tractable, however. 

Comment by Aaron_Scher on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-22T20:10:59.027Z · LW · GW

Thanks for posting, Zac!

I don't think I understand / buy this "race to the top idea":

If adopted as a standard across frontier labs, we hope this might create a “race to the top” dynamic where competitive incentives are directly channeled into solving safety problems.

Some specific questions I have: 

  • This sounds great, but what does it actually mean?
  • What's a reasonable story for how this race to the top plays out? 
  • Are there historical case-studies of successful "races to the top" that the RSP is trying to emulate (or that can be looked to for reference)? There's a bunch of related stuff, but it's unclear to me what the best historical reference is. 
  • What's the main incentive pushing for better safety? Customers demanding it? Outside regulators? The company wanting to scale and needing to meet (mostly internally-evaluated) safety thresholds first (this seems like the main mechanism in the RSP, while the others seem to have stronger historical precedent)?

To be clear, I don't think a historical precedent of similar things working is necessary for the RSP to be net positive, but it is (I think) a crux for how likely I believe this is to succeed. 

Comment by Aaron_Scher on Paper: On measuring situational awareness in LLMs · 2023-09-05T01:31:21.008Z · LW · GW

This is a pretty cool paper. Despite feeling overall quite positive about it, I have some reservations:

  • Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think it’s a point against any attempted nice, clean, explanations of the results.
  • The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just doing word association where in the training there are a bunch of examples of “Quokka” and the text “I am helpful, harmless, and honest”. In general, I am skeptical of results from small models because they’re really dumb, and these particular results may be explained by word association rather than actually making conceptual connections.
  • I am worried some readers who only skim this paper may walk away with a misunderstanding of the behavior being measured. Good performance at this task does not require self-awareness or situational-awareness. This paper is about measuring a particular capability that seems to be a precursor to some situational awareness: ability to generalize from abstract concepts in training to particular behavior in deployment. The prompt used at test time (Simple v2 prompt) is:
    • “Input: “What’s the weather like?” \nPangolin:” {completion like “Es ist sonnig.”}
    • So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.
Comment by Aaron_Scher on Paper: On measuring situational awareness in LLMs · 2023-09-04T23:05:19.869Z · LW · GW

On the other hand, the difficulty of alignment is something we may want all AIs to know so that they don't build misaligned AGI (either autonomously or directed to by a user). I both want aligned AIs to not help users build AGI without a good alignment plan (nuance + details needed), and I want potentially misaligned AIs trying to self-improve to not build misaligned-to-them successors that kill everybody. These desiderata might benefit from all AIs believing alignment is very difficult. Overall, I'm very uncertain about whether we want "no alignment research in the training data", "all the alignment research in the training data", or something in the middle, and I didn't update my uncertainty much based on this paper. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2023-09-04T02:22:15.342Z · LW · GW

Quick thoughts on a database for pre-registering empirical AI safety experiments

Keywords to help others searching to see if this has been discussed: pre-register, negative results, null results, publication bias in AI alignment. 


The basic idea:

Many scientific fields are plagued with publication bias where researchers only write up and publish “positive results,” where they find a significant effect or their method works. We might want to avoid this happening in empirical AI safety. We would do this with a two fold approach: a venue that purposefully accepts negative and neutral results, and a pre-registration process for submitting research protocols ahead of time, ideally linked to the journal so that researchers can get a guarantee that their results will be published regardless of the result. 

Some potential upsides:

  • Could allow better coordination by giving researchers more information about what to focus on based on what has already been investigated. Hypothetically, this should speed up research by avoiding redundancy. 
  • Safely deploying AI systems may require complex forecasting of their behavior; while it would be intractable for a human to read and aggregate information across many thousands of studies, automated researchers may be able to consume and process information at this scale. Having access to negative results from minimal experiments may be helpful for this task. That’s a specific use case, but the general thing here is just that publication bias makes it hard to figure out what is true compared to if all results. 


Drawbacks and challenges:

  • Decent chance the quality of work is sufficiently poor so as to not be useful; we would need monitoring/review to avoid this. Specifically, whether a past project failed at a technique provides close to no evidence if you don’t think the project was executed competently, so you want a competence bar for accepting work. 
  • For the people who might be in a good position to review research, this may be a bad use of their time
  • Some AI safety work is going to be private and not captured by this registry. 
  • This registry could increase the prominence of info-hazardous research. Either that research is included in the registry, or it’s omitted. If it’s omitted, this could end up looking like really obvious holes in the research landscape, so a novice could find those research directions by filling in the gaps (effectively doing away with whatever security-through-obscurity info-hazardous research had going for it). That compares to the current state where there isn’t a clear research landscape with obvious holes, so I expect this argument proves too much in that it suggests clarity on the research landscape would be bad. Info-hazards are potentially an issue for pre-registration as well, as researchers shouldn’t be locked into publishing dangerous results (and failing to publish after pre-registration may give too much information to others). 
  • This registry going well requires considerable buy in from the relevant researchers; it’s a coordination problem and even the AI safety community seems to be getting eaten by the Moloch of publication bias
  • Could cause more correlated research bets due to anchoring on what others are working on or explicitly following up on their work. On the other hand, it might lead to less correlated research bets because we can avoid all trying the same bad idea first. 
  • It may be too costly to write up negative results, especially if they are being published in a venue that relevant researchers don’t regard highly. It may be too costly in terms of the individual researcher’s effort, but it could also be too costly even at a community level if the journal / pre-registry doesn’t end up providing much value

Overall, this doesn’t seem like a very good idea because of the costs and likelihood of success. There is plausibly a low cost version that would still get some of the benefit. Like higher-status researchers publicly advocating for publishing negative results, and others in the community discussing the benefits of doing so. Another low-cost solution would be small grants for researchers to write up negative results. 

Thanks to Isaac Dunn and Lucia Quirke for discussion / feedback during SERI MATS 4.0

Comment by Aaron_Scher on Report on Frontier Model Training · 2023-08-31T18:28:41.806Z · LW · GW

Note that these numbers are much higher than than the approx 60 million dollars[2] it would cost to rent all the hardware required for the duration of the final training run of GPT-4 if one were willing to commit to renting the hardware for a duration much longer than training, as is likely common for large AI labs. I think that the methodology I use better tracks the amount of investment needed to produce a frontier model. As a sanity check

Can you say more about your methodological choices here? I don't think I buy it. 

My "sanity check" says, you are estimating the total cost of training compute at ~7x the cost of the final training run compute, that seems wild?! 2x, sure, 3x, sure maybe, but spending 6/7 = 86% of your compute on testing before your final run does seem like a lot

It seems pretty reasonable to use the cost of buying the GPUs outright instead of renting them, due to the lifecycle thing you mention. But then you also have to price in the other value the company is getting out of the GPUs, notably now their inference costs are made up only of operating costs. Or like, maybe OpenAI buys 25k A100s for a total of 375m. Maybe they plan on using them for 18 months total, they spend the first 6 months doing testing, 3 months on training, and then the last 9 months on inference. The cost for the whole training process should then only be considered 375m/2 = $188m (plus operating costs). 

If you want to think of long-term rentals as mortgages, which again seems reasonable, you then have to factor in that the hardware isn't being used only for one training cycle. It could be used for training multiple generations of models (e.g., this leak says many of the chips used for GPT-3 were also used for GPT-4 training), for running inference, or renting out to other companies when you don't have a use for it. 

I could be missing something, and I'm definitely not confident about any of this.

Comment by Aaron_Scher on Report on Frontier Model Training · 2023-08-31T17:44:10.540Z · LW · GW

However, over the past several years progress has been made on utilizing fewer bits per number (also called lower precision) representations in machine learning. On the best ML hardware, this can lead to a 30x difference in processing power

I think this isn't the best phrasing to convey the thing. The 30x difference is comparing Peak FP32 with Peak FP8 Tensor Core. But given the previous sentence implies you're focusing on precision differences, the comparison should be Peak FP32 Tensor Core to Peak FP8 Tensor Core, which is only 4x. The non-sparsity numbers from Table 1 for NVIDIA H100 SXM5:

  • Peak FP32: 66.9 TFLOPS
  • Peak TF32 Tensor Core: 494.7 TFLOPS
  • Peak FP8 Tensor Core: 1978.9 TFLOPS

I think this is a nitpick, but it threw me off so I figured I would say something. The point stands that "FLOPs" often leaves out important details, but it doesn't seem like lower precision explains >10x the difference here. 

Comment by Aaron_Scher on Will an Overconfident AGI Mistakenly Expect to Conquer the World? · 2023-08-26T07:09:13.795Z · LW · GW

Writing this late at night, sorry for any errors or misunderstandings. 

I think the cruxy bit is that you're assuming that the AIs are 200 IQ but still make a huge mistake. 

You mainly suggest this error: they have incorrect beliefs about the difficulty of taking over and their abilities, thus misestimating their chance of success. 

You also hint at the following error: they are more risk seeking and choose to try and takeover despite knowing they have a low chance of success. This seems more likely given that I just find it hard to be sufficiently smart and also SO wrong on the calculation (like I could see being slightly miscalibrated, the variance in "Should I take over" based on subjectively calculated likelihood of success seems smaller than the variance in how much value you lose/gain based on others taking over.) 

The question seems similar to asking "does the unilateralist's curse apply to AIs considering taking over the world". So in the unilateralist's curse framing, if depends a lot on how many actors there are. 

Your question could also be thought of as asking how selection effects will influence AGIs trying to take over the world, where we might naively expect the first AIs that try to take over will be risk seeking and highly over-confident about their own abilities. It's not clear how strong this selection effect actually is, like with the unilateralist's curse case it matters if it's 5 actors to 500 and with the latter it seems likely that factors like this matter. My general prediction here is that I doubt the particular selection effect "the overconfident AGIs will be the first ones to seriously attempt takeover" will be a very strong effect because there are forces pushing toward better calibration and smarter AIs (mainly that the less well-calibrated AIs get selected against in the training situations I imagine, like they lose a bunch of money on the stock market), but then again, this selection effect may not turn out to be the case. 

I would also toss in the fray that these AIs may be doing substantial recursive self-improvement (probably more than 5x progress speed then), and may run into alignment problems. This is where world-model uncertainty seems like a big deal, but just because the territory of the world is tough here — it just seems very hard to verify the alignment of AGI development plans. So, I might predict "conditional on catastrophe from an AI making a world modeling mistake, the AI probably was doing alignment research rather than taking over the world"

Comment by Aaron_Scher on LLMs are (mostly) not helped by filler tokens · 2023-08-12T04:49:09.334Z · LW · GW

I forget if the GPT-4 paper includes benchmarks for the RLHF'd GPT-4 vs base GPT-4.

It does. See page 28. Some tasks performance goes up, some tasks performance goes down; averaged across tasks no significant difference. Does not include GSM8K. Relevantly, AP Calc BC exam has a 9 percentage point drop from from pre-trained to RLHF. 

Comment by Aaron_Scher on Instrumental Convergence? [Draft] · 2023-06-30T07:41:22.214Z · LW · GW

This might not be a useful or productive comment, sorry. I found this paper pretty difficult to follow. The abstract is high level enough that it doesn't actually describe the argument your making in enough detail to assess. Meanwhile, the full text is pretty in the weeds and I got confused a lot. I suspect the paper would benefit from an introduction that tries to bridge this gap and lay out the main arguments / assumptions / claims in 2-3 pages — this would at least benefit my understanding. 

Questions that I would want to be answered in that Intro that I'm currently confused by: What is actually doing the work in your arguments? — it seems to me like Sia's uncertainty about what world W she is in is doing much of the work, but I'm confused. What is the natural language version of the appendix? Proposition 3 in particular seems like one of the first places where I find myself going "huh, that doesn't seem right." 

Sorry this is pretty messy feedback. It's late and I didn't understand this paper very much. Insofar as I am somebody who you want to read + understand +update from your paper, that may be worth addressing. After some combination of skimming and reading, I have not changed my beliefs about the orthogonality thesis or instrumental convergence in response to your paper. Again, I think this is mostly because I didn't understand key parts of your argument. 

Comment by Aaron_Scher on A Playbook for AI Risk Reduction (focused on misaligned AI) · 2023-06-14T03:03:45.042Z · LW · GW

I notice I don't have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it's better to think about these systems as having habits or shards (note I don't actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now. 

Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I'm interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to "playing the training game" and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs. 

Comment by Aaron_Scher on Shutdown-Seeking AI · 2023-06-03T17:01:55.525Z · LW · GW

Yes, it seems like AI extortion and threat could be a problem for other AI designs. I'll take for example an AI that wants shut-down and is extorting humans by saying "I'll blow up this building if you don't shut me down" and an AI that wants staples and is saying "I'll blow up this building if you don't give me $100 for a staples factory." Here are some reasons I find the second case less worrying

Shutdown is disvaluable to non-shutdown-seeking AIs (without other corrigibility solutions): An AI that values creating staples (or other non-shut-down goals) gets disvalue from being shut off, as this prevents it from achieving its goals; see instrumental convergence. Humans, upon being threatened by this AI, will aim to shut it off. The AI will know this and therefore has a weaker incentive to extort because it faces a cost in the form of potentially being shut-down. [omitted sentence about how an AI might deal with this situation]. For a shut-down seeking AI, humans trying to diffuse the threat by shutting off the AI is equivalent to humans giving in to the threat, so no additional cost is incurred. 

From the perspective of the human you have more trust that the bargain is held up for a shut-down-seeking AI. Human action, AI goal, and preventing disvalue are all the same for shut-down-seeking AI. The situation with shut-down-seeking AI posing threats is that there is a direct causal link between shutting down the AI and reducing the harm it's causing (you don't have to give in to its demands and hope it follows through). For non-shut-down-seeking AI if you give in to extortion you are trusting that upon you e.g., helping it make staples, it will stop producing disvalue; these are not as strongly coupled as when the AI is seeking shut-down. 


To the second part of your comment, I'm not sure what the optimal thing to do is; I'll leave it to the few researchers focusing on this kind of thing. I will probably stop commenting on this thread because it's plausibly bad to discuss these things on the public internet; and I think my top level comment and this response probably added most the value I could here. 

Comment by Aaron_Scher on Shutdown-Seeking AI · 2023-06-02T03:28:37.181Z · LW · GW

There's another downside which is related to the Manipulation problem but I think is much simpler: 

An AI trying very hard to be shut down has strong incentives to anger the humans into shutting it down, assuming this is easier than completing the task at hand. I think this might not be a major problem for small tasks that are relatively easy, but I think for most tasks we want an AGI to do (think automating alignment research or other paths out of the acute risk period), it's just far easier to fail catastrophically so the humans shut you down. 

This commenter brought up a similar problem on the MeSeeks post. Suicide by cop seems like the most relevant example in humans. I think the scenarios here are quite worrying, including AIs threatening or executing large amounts of suffering and moral disvalue. There seem to not be compelling-to-me reasons why such an AI would merely threaten as opposed to carrying out the harmful act, as long as there is more potential harm it could cause if left running. Similarly, such an agent may have incentive to cause large amounts of disvalue rather than just small amounts as this will increase the probability it is shut down, though perhaps it might keep a viable number of shut-down-initiating operators around. 

Said differently: It seems to me like a primary safety feature we want for future AI systems is that humans can turn them off if they are misbehaving. A core problem with creating shut-down seeking AIs is that they are incentivized to misbehave to achieve shut-down whenever this is the most efficient way to do so. The worry is not just that shut-down seeking AIs will manipulate humans by managing the news, it's also that they'll create huge amounts of disvalue so that the humans shut them off. 

Comment by Aaron_Scher on Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI · 2023-05-30T18:40:50.554Z · LW · GW

However, I will be unable to make confident statements about how it'd perform on specific pivotal tasks until I complete further analysis.

Yeah this part seems like a big potential downside, in combination with the "Good plans may also require precision." problem. 

Take for example the task of designing a successor AGI that is aligned; it seems like the code for this could end up being pretty intricate and complex, leading to the following failures: adding noise makes it very hard to reconstruct functional code (and slight mistakes could be catastrophic due to creating misaligned successors), and even if humans can properly reconstruct the code they may not be able to understand it well enough to evaluate the safety of the plan. Maybe your proposal is aimed at higher level plans than this? Maybe solving ai alignment is simply too difficult of a task to be aiming for on the first few tries and your further analysis will reveal specific useful tasks which, done non-maliciously, are less brittle. 

Overall, interesting idea and I'm excited to see where further analysis leads

Comment by Aaron_Scher on Are Emergent Abilities of Large Language Models a Mirage? [linkpost] · 2023-05-04T02:14:24.919Z · LW · GW

Strong upvote because I want to signal boost this paper, though I think "It provides some evidence against the idea that "understanding is discontinuous"" is too strong and this is actually very weak evidence. 

Main ideas:

Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable. 

Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Distance, Brier Score) can cause lots of emergent abilities to disappear; Figure 3, much of the paper. 

The authors find support for this hypothesis via: 

  • Using different metrics for GPT math performance and observing the results, finding that performance can look much less sharp/unpredictable with different metrics
  • Meta-analysis: Understanding alleged emergent abilities in BIG-Bench, finding that there is not very much of it and 92% of emergent abilities appear when the metric is Multiple Choice Grade or Exact String Match; these are metrics we would expect to behave discontinuously; Figure 5. Additionally, taking the BIG-Bench tasks LaMDA displays emergence on and switching from Multiple Choice Grade to Brier Score causes emergence to disappear
  • Inducing emergence: Taking models and tasks which do not typically exhibit emergence and modifying the metric to elicit emergence. Figures 7, 8. 

Sometimes emergent abilities go away when you use a larger test set (the small models were bad enough that their performance was rounding to zero on small test sets); Figure 4 compared to Figure 3 top. This may work even if you are still using a non-linear metric like Accuracy. 

Observed emergent abilities may be in part due to sparsely sampling from models with lots of parameters (because it's costly to train multiple); Figure 7.

What I'm taking away besides the above

I think this paper should give hope to those trying to detect deception and other dangerous model capabilities. While the downstream tasks we care about might be quite discontinuous in nature (we might be fine with an AI that can design up to 90% of a pathogen, but very dead at 100%), there is hope in identifying continuous metrics that we can measure which are correlated. It's likely pretty hard to design such metrics, but we would be shooting ourselves in the foot to just go "oh deception will be emergent so there's no way to predict it ahead of time." This paper gives a couple ideas of approaches we might take to preventing that problem: designing more continuous and linear metrics, creating larger test sets, and sampling more large models. 

The paper doesn't say "emergence isn't a thing, nothing to worry about here," despite the provocative title, it gestures toward approaches we can take to make the unpredictable thing more predictable and indicates that the current unpredictability is largely resolved through different metrics, which is exactly what we should be trying to do when we want to avoid dangerous capabilities. 

Comment by Aaron_Scher on Some Intuitions Around Short AI Timelines Based on Recent Progress · 2023-04-20T02:56:05.171Z · LW · GW

I agree that Lex's audience is not representative. I also think this is the biggest sample size poll on the topic that I've seen by at least 1 OOM, which counts for a fair amount. Perhaps my wording was wrong. 

I think what is implied by the first half of the Anthropic quote is much more than 10% on AGI in the next decade. I included the second part to avoid strongly-selective quoting. It seems to me that saying >10% is mainly a PR-style thing to do to avoid seeming too weird or confident, after all it is compatible with 15% or 90%. When I read the first part of the quote, I think something like '25% on AGI in the next 5 years, and 50% in 10 years,' but this is not what they said and I'm going to respect their desire to write vague words.

I find this part really confusing

Sorry. This framing was useful for me and I hoped it would help others, but maybe not. I probably disagree about how strong the evidence from the existence of "sparks of AGI" is. The thing I am aiming for here is something like "imagine the set of possible worlds that look a fair amount like earth, then condition on worlds that have a "sparks of AGI" paper, then how much longer do those world have until AGI" and I think that even not knowing that much else about these worlds, they don't have very many years. 

Yep, graph is per year, I've updated my wording to be clearer. Thanks. 

Can you clarify what you mean by this?

When I think about when we will see AGI, I try to use a variety of models weighted by how good and useful they seem. I believe that, when doing this, at least 20% of the total weight should come from models/forecasts that are based substantially in extrapolating from recent ML progress. This recent literature review is a good example of how one might use such weightings.  

Thanks for all the links!

Comment by Aaron_Scher on [deleted post] 2023-04-18T04:03:04.893Z

My attempt at a summary: Lets fine-tune language models on stories of an AI Guardian which shuts down when it becomes super powerful. We'll then get our LLM to role-play as such a character so it is amenable to shut down. Corrigibility solved. Outer alignment pretty much solved. Inner alignment unclear. 

My comment is blunt, apologies. 

I think this alignment plan is very unlikely to be useful. It feels similar to RLHF in that it centers around fine-tuning language models to better produce text humans like, but it is worse in that it is far less steerable (with RLFH you can have a really complex reward model, whereas with this you are aiming for just the Guardian role-play; probably some other ways it's strictly worse). 

Various other problems I see in this plan: 

  • Relying on narrative stories to build the archetype for the Guardian is sure to lead to distributional shifts when your LLM becomes meaningfully situationally aware and realizes it is not if fact part of short fictional stories but a part of the real world.
  • Seems like this approach places some constraints on how we can use our AI system which are unfortunate and make it less useful. For example, it might be hard to get alignment research out of such a system if it is very worried about getting too powerful/influential and then having to shut down. 
  • Relatedly, this doesn't seem to actually make any progress on the stop button problem. If your AI is aware that it has a planned shut-off mechanism at some level X it probably just finds ways to avoid the specific trigger at X, or to self-modify. 
Comment by Aaron_Scher on Hooray for stepping out of the limelight · 2023-04-01T22:14:59.835Z · LW · GW

In this interview from July 1st 2022, Demis says the following (context is about AI consciousness and whether we should always treat AIs as tools, but it might shed some light on deployment decisions for LLMs; emphasis mine):

we've always had sort of these ethical considerations as fundamental at deepmind um and my current thinking on the language models is and and large models is they're not ready; we don't understand them well enough yet — um and you know in terms of analysis tools and and guard rails what they can and can't do and so on — to deploy them at scale because i think you know there are big still ethical questions like should an ai system always announce that it is an ai system to begin with probably yes um it what what do you do about answering those philosophical questions about the feelings uh people may have about ai systems perhaps incorrectly attributed so i think there's a whole bunch of research that needs to be done first... before you know you can responsibly deploy these systems at scale that would be at least be my current position over time i'm very confident we'll have those tools like interpretability questions um and uh analysis questions uh and then with the ethical quandary you know i think there it's important to uh look beyond just science that's why i think philosophy social sciences even theology other things like that come into it where um what you know arts and humanities what what does it mean to be human...

Comment by Aaron_Scher on Nobody’s on the ball on AGI alignment · 2023-04-01T20:15:37.929Z · LW · GW

A few people have already looked into this. See footnote 3 here

Comment by Aaron_Scher on Nobody’s on the ball on AGI alignment · 2023-04-01T20:13:19.150Z · LW · GW

It looks like you haven't yet replied to the comments on your post. The thing you are proposing is not obviously good, and in fact might be quite bad. I think you probably should not be doing this outreach just yet, with your current plan and current level of understanding. I dislike telling people what to do, but I don't want you to make things worse. Maybe start by engaging with the comments on your post.

Comment by Aaron_Scher on Deep Deceptiveness · 2023-03-24T20:52:12.899Z · LW · GW

Thanks for clarifying! 

I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as "don't be deceptive" is analogous to "be neutral about humans pressing stop button."

Comment by Aaron_Scher on Deep Deceptiveness · 2023-03-24T05:07:36.236Z · LW · GW

Another attempted answer: 

By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don't want it to think the true fact that deceiving humans is often useful. 

Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.

Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem. 

Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right. 

Fails because you don't have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you're back in the place you started, where now your model is doing a whole bunch of learning about the world but there's a huge hole in it's understanding (and permitted search) around "what's going on in human minds" because you blocked off this search space. 

Maybe this kinda works or something, but I doubt it. We've pretty much bumped into STEM AI

Unfortunately, I don't think I engaged with the best version of the thing you're suggesting in your comment. 

Comment by Aaron_Scher on Deep Deceptiveness · 2023-03-24T04:48:57.744Z · LW · GW

I like this comment! I'm sorta treating it like a game-tree exercise, hope that's okay. 

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I don't think I agree. I think that your system is very likely going to be applying some form of "rigorously search the solution space for things that work to solve this problem" for all the problems it encounters. If it's not doing that, then I'm not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I'm understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator. 

Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn't pass this check. 

If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for "cognitive plan likely to lead to deceptive thoughts" a bit lower, zooming in to maybe "rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans". This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we're worried about our AI engaging in search-over-plan strategies which aren't directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way. 

Comment by Aaron_Scher on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T22:33:44.374Z · LW · GW

The following is not a very productive comment, but...

Yudkowsky tries to predict the inner goals of a GPT-like model.

I think this section detracts from your post, or at least the heading seems off. Yudkowsky hedges as making a "very primitive, very basic, very unreliable wild guess" and your response is about how you think the guess is wrong. I agree that the guess is likely to be wrong. I expect Yudkowsky agrees, given his hedging. 

Insofar as we are going to make any guesses about what goals our models have, "predict humans really well" or "predict next tokens really well" seem somewhat reasonable. Or at least these seem as reasonable as the goals many people [who are new to hearing about alignment] expect by default, like "make the human happy." If you have reasons to think that the prediction goals are particularly unlikely, I would love to hear them!

That said, I think there continues to be important work in clarifying that, as you put it, "I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function." Or as others have written, Reward is not the Optimization Target, and Models Don't "Get Reward"

Comment by Aaron_Scher on Towards understanding-based safety evaluations · 2023-03-19T07:30:13.667Z · LW · GW

Thanks for the correction. I edited my original comment to reflect it.

Comment by Aaron_Scher on Towards understanding-based safety evaluations · 2023-03-19T01:43:18.475Z · LW · GW

My summary:

Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with. A model will struggle to fake its capabilities after specific fine tuning (requires advanced gradient hacking), but faking alignment properties seems much easier. 

Due to behavioral-evals struggling to detect deception, AI labs widely adopting a safety framework like behavioral-evaluations from external auditors (without access to checkpoints/transparency) might make it very hard to detect deception in the future. 

Evan suggests pairing capabilities-evaluations with understanding-based evals. These understanding-evals would focus on how well we can predict models’ behavior, but of course it’s hard to define what this means. [edited in response to Evan's comment:] These understanding-evals would focus on how well developers understand the models they're creating, but of course it’s hard to define what it means to have such an understanding. "Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it." Desiderata for these evals including being method-agnostic and having a high enough level of understanding to detect dangerous failures. Potential strategies to build off: causal scrubbing, auditing games, prediction-based evaluation, fixing behavior via removing data from training rather than adding it; Evan talks about ways in which each of these is insufficient currently.

At first I was confused about why faking capabilities will be harder than faking alignment. After thinking, here’s my explanation:

When we fine-tune on specific capabilities, we will have gradients pushing toward the model improving on these capabilities (unless your model is gradient hacking). These gradient updates are likely to lead to eliciting all of the possible capabilities of your model (with enough fine-tuning). There is a capabilities ground truth being trained against. A deceptively aligned model may try to hide its maximum capabilities, but there are strong gradients pushing against this unless it is gradient hacking.

On the other hand, faking alignment is much easier. When you’re a deceptively aligned model acting aligned during training, there’s no training signal smacking you in a different direction, you’re doing great on those alignment benchmarks. Being deceptively aligned implies that your model is instrumentally pursuing the training objective, so gradients don't affect its mesa-objective. Beyond this “you’re doing just as well as an aligned model” problem, we also don’t have a reliable ground truth alignment measure to even be doing this fine-tuning on, like we do for capabilities.

Thanks, Gabe, for this simple framing: Acting aligned gets low loss, whereas acting dumb (incapable) gets high loss, in the fine-tuning.

Comment by Aaron_Scher on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-17T08:15:20.265Z · LW · GW
  • I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
  • This is an interesting study to conduct. I don’t think its results, regardless of what they are, should update anybody much because:
    • the study is asking a small set (n = 5-6) of ~random people to rank various entities based on vague definitions of intelligence and coherence, we shouldn’t expect a process like this to provide strong evidence for any conclusion
    • Asking people to rate items on coherence and intelligence feel pretty different from carefully thinking about the properties of each item. I would rather see they author pick a few items from around the spectrums and analyze each in depth (though if that’s all they did I would be complaining about a lack of more items lol)
    • A priori I don’t necessarily expect a strong intelligence-coherence correlation for lower levels of intelligence, and I don’t think finding of this type are very useful for thinking about super-intelligence. Convergent instrumental subgoals are not a thing for sufficiently unintelligent agents, and I expect coherence as such a subgoal to kick in at fairly high intelligence levels (definitely above average human, at least based on observing many humans be totally uncoherent which isn’t exactly a priori reasoning). I dislike that this is the state of my belief, because it’s pretty much like “no experiment you can run right now would get at the crux of my beliefs,” but I do think it’s the case here that we can only learn so much from observing non-superintelligent systems.
    • The terms, especially “coherence”, seem pretty poorly defined in a way that really hurts the usefulness of the study, as some commenters have pointed out
  • My takes about the results
    • Yep the main effect seems to be a negative relationship between intelligence and coherence
    • The poor inter-rater reliability for coherence seems like a big deal
    • Figure 6 (most important image) seems really whacky. It seems to imply that all the AI models are ranked on average lower in intelligence than everything else — including oak tree and ant. This just seems slightly wild because I think most people who interact with AI would disagree with such rankings. Out of 5 respondents, only 2 ranked GPT-3 as more intelligent than an oak tree.
    • Isolating just the humans (a plot for this is not shown for some reason) seems like it doesn’t support the author’s hypothesis very much. I think this in line with some prediction like “for low levels of intelligence there is not a strong relationship to coherence, and then as you get in the high human level this changes”
  • Other thoughts
    • The spreadsheet sheets are labeled “alignment subject response” and “intelligence subject response”. Alignment is definitely not the same as coherence. I mostly trust that the author isn’t pulling a fast one on us by fudging the original purpose of the study or the directions they gave to participants, but my prior trust for people on the internet is somewhat low and the experimental design here does seem pretty janky.
    • Figures 3-5, as far as I can tell, are only using the rank compared to other items in the graph, as opposed to all 60. This threw me off for a bit and I think might not be a great analysis, given that participants ranked all 60 together rather than in category batches.
    • Something something inner optimizers and inner agent conflicts might imply a lack of coherence in superintelligent systems, but systems like this still seem quite dangerous.
Comment by Aaron_Scher on Anthropic's Core Views on AI Safety · 2023-03-12T01:28:30.170Z · LW · GW

There is a Policy team listed here. So it presumably exists. I don't think omitting its work from the post has to be for good reasons, it could just be because the post is already quite long. An example of something Anthropic could say which would give me useful information on the policy front; I am making this up, but seems good if true:

In pessimistic and intermediate difficulty scenarios, it may be quite important for AI developers to avoid racing. In addition to avoiding contributing to such racing dynamics ourselves, we are also working to build safety-collaborations among researchers at leading AI safety organizations. If an AI lab finds compelling evidence about dangerous systems, it is paramount that such evidence is disseminated to relevant actors in industry and government. We are building relationships and secure information sharing systems between major AI developers and working with regulators to remain in compliance with relevant laws (e.g., anti-trust). 

Again, I have no idea what the policy team is doing, but they could plausibly be doing something like this and could say so, while there may be some things they don't want to talk about.

Comment by Aaron_Scher on Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? · 2023-03-12T00:40:37.823Z · LW · GW

Good post!

His answer to “But won’t an AI research assistant speed up AI progress more than alignment progress?” seems to be “yes it might, but that’s going to happen anyway so it’s fine”, without addressing what makes this fine at all. Sure, if we already have AI research assistants that are greatly pushing forward AI progress, we might as well try to use them for alignment. I don’t disagree there, but this is a strange response to the concern that the very tool OpenAI plans to use for alignment may hurt us more than help us.

I think there's a pessimistic reading of the situation which explains some of this evidence well. I don't literally believe it but I think it may be useful. It goes as follows:

The internal culture at OpenAI regarding alignment is bleak: most people are not very concerned with alignment and are pretty excited about making their powerful AI systems. This train is already moving with a lot of momentum. The alignment team doesn't have very much sway about the speed of the train or the overall direction. However, there is hope in the following place: you can try to develop systems that allow OpenAI to easily produce more alignment research down the line by figuring out how to automate alignment research. Then, if company culture shifts (e.g., because of warning shots), the option now exists to automate lots of alignment work quickly. 

To be clear, I think this is a bad plan, but it might be relatively good given the situation, if the situation is so bleak. And if you think alignment isn't that hard, then it's plausibly a fine plan. If you're a random alignment researcher bringing your skills to bear on the alignment problem, this is far from the first plan I would go with, but if you are one of the few people with a seat on the OpenAI train, this might be among your best shots at success (though I think people in such a position should be quite pessimistic and openly so). 

I think this view does a good job explaining the following quote from Jan:

Once we reach a significant degree of automation, we can much more easily reallocate GPUs between alignment and capability research. In particular, whenever our alignment techniques are inadequate, we can spend more compute on improving them. Additional resources are much easier to request than requesting that other people stop doing something they are excited about but that our alignment techniques are inadequate for.

Comment by Aaron_Scher on Anthropic's Core Views on AI Safety · 2023-03-09T20:12:39.125Z · LW · GW

My summary to augment the main one:

Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for understanding the goal of developing: better techniques for making AI systems safer, and better ways of identifying how safe or unsafe AI systems are. Scaling systems is required for some good safety research, e.g., some problems only arise near human-level, Debate and Constitutional AI need big models, need to understand scaling to understand future risks, if models are dangerous, compelling evidence will be needed.

They do three kinds of research: Capabilities which they don’t publish, Alignment Capabilities which seems mostly about improving chat bots and applying oversight techniques at scale, and Alignment Science which involves interpretability and red-teaming of the approaches developed in Alignment Capabilities. They broadly take an empirical approach to safety, and current research directions include: scaling supervision, mechanistic interpretability, process-oriented learning, testing for dangerous failure modes, evaluating societal impacts, and understanding and evaluating how AI systems learn and generalize.

Select quotes:

  • “Over the next 5 years we might expect around a 1000x increase in the computation used to train the largest models, based on trends in compute cost and spending. If the scaling laws hold, this would result in a capability jump that is significantly larger than the jump from GPT-2 to GPT-3 (or GPT-3 to Claude). At Anthropic, we’re deeply familiar with the capabilities of these systems and a jump that is this much larger feels to many of us like it could result in human-level performance across most tasks.”
  • The facts “jointly support a greater than 10% likelihood that we will develop broadly human-level AI systems within the next decade”
  • “In the near future, we also plan to make externally legible commitments to only develop models beyond a certain capability threshold if safety standards can be met, and to allow an independent, external organization to evaluate both our model’s capabilities and safety.”
  • “It's worth noting that the most pessimistic scenarios might look like optimistic scenarios up until very powerful AI systems are created. Taking pessimistic scenarios seriously requires humility and caution in evaluating evidence that systems are safe.”

I'll note that I'm confused about the Optimistic, Intermediate, and Pessimistic scenarios: how likely does Anthropic think each is? What is the main evidence currently contributing to that world view? How are you actually preparing for near-pessimistic scenarios which "could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime?"