Posts

Initial Experiments Using SAEs to Help Detect AI Generated Text 2024-07-22T05:16:20.516Z
Idea: Safe Fallback Regulations for Widely Deployed AI Systems 2024-03-25T21:27:05.130Z
Some of my predictable updates on AI 2023-10-23T17:24:34.720Z
Five neglected work areas that could reduce AI risk 2023-09-24T02:03:29.829Z
Aaron_Scher's Shortform 2023-09-04T02:22:15.197Z
Some Intuitions Around Short AI Timelines Based on Recent Progress 2023-04-11T04:23:22.361Z

Comments

Comment by Aaron_Scher on Efficient Dictionary Learning with Switch Sparse Autoencoders · 2024-07-23T07:10:40.431Z · LW · GW

Nice work, these seem like interesting and useful results! 

High level question/comment which might be totally off: one benefit of having a single, large, SAE neuron space that each token gets projected into is that features don't get in each other's way, except insofar as you're imposing sparsity. Like, your "I'm inside a parenthetical" and your "I'm attempting a coup" features will both activate in the SAE hidden layer, as long as they're in the top k features (for some sparsity). But introducing switch SAEs breaks that: if these two features are in different experts, only one of them will activate in the SAE hidden layer (based on whatever your gating learned). 

The obvious reply is "but look at the empirical results you fool! The switch SAEs are pretty good!" And that's fair. I weakly expect what is happening in your experiment is that similar but slightly specialized features are being learned by each expert (a testable hypothesis), and maybe you get enough of this redundancy that it's fine e.g,. the expert with "I'm inside a parenthetical" also has a "Words relevant to coups" feature and this is enough signal for coup detection in that expert. 

Again, maybe this worry is totally off or I'm misunderstanding something. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-07-17T16:58:48.685Z · LW · GW

Thanks for the addition, that all sounds about right to me!

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-07-15T20:11:38.047Z · LW · GW

Leaving Dangling Questions in your Critique is Bad Faith

Note: I’m trying to explain an argumentative move that I find annoying and sometimes make myself; this explanation isn’t very good, unfortunately. 

Example

Them: This effective altruism thing seems really fraught. How can you even compare two interventions that are so different from one another? 

Explanation of Example

I think the way the speaker poses the above question is not as a stepping stone for actually answering the question, it’s simply as a way to cast doubt on effective altruists. My response is basically, “wait, you’re just going to ask that question and then move on?! The answer really fucking matters! Lives are at stake! You are clearly so deeply unserious about the project of doing lots of good, such that you can pose these massively important questions and then spend less than 30 seconds trying to figure out the answer.” I think I might take these critics more seriously if they took themselves more seriously. 

Description of Dangling Questions

A common move I see people make when arguing or criticizing something is to pose a question that they think the original thing has answered incorrectly or is not trying sufficiently hard to answer. But then they kinda just stop there. The implicit argument is something like “The original thing didn’t answer this question sufficiently, and answering this question sufficiently is necessary for the original thing to be right.”

But importantly, the criticisms usually don’t actually argue that — they don’t argue for some alternative answer to the original questions, if they do they usually aren’t compelling, and they also don’t really try to argue that this question is so fundamental either. 

One issue with Dangling Questions is that they focus the subsequent conversation on a subtopic that may not be a crux for either party, and this probably makes the subsequent conversation less useful. 

Example

Me: I think LLMs might scale to AGI. 

Friend: I don’t think LLMs are actually doing planning, and that seems like a major bottleneck to them scaling to AGI. 

Me: What do you mean by planning? How would you know if LLMs were doing it? 

Friend: Uh…idk

Explanation of Example

I think I’m basically shifting the argumentative burden onto my friend when it falls on both of us. I don’t have a good definition of planning or a way to falsify whether LLMs can do it — and that’s a hole in my beliefs just as it is a hole in theirs. And sure, I’m somewhat interested in what they say in response, but I don’t expect them to actually give a satisfying answer here. I’m posing a question I have no intention of answering myself and implying it’s important for the overall claim of LLMs scaling to AGI (my friend said it was important for their beliefs, but I’m not sure it’s actually important for mine). That seems like a pretty epistemically lame thing to do. 

Traits of “Dangling Questions”

  1. They are used in a way that implies the target thing is wrong vis a vis the original idea, but this argument is not made convincingly.
  2. The author makes minimal effort to answer the question with an alternative. Usually they simply pose it. The author does not seem to care very much about having the correct answer to the question.
  3. The author usually implies that this question is particularly important for the overall thing being criticized, but does not usually make this case.
  4. These questions share a lot in common with the paradigm criticisms discussed in Criticism Of Criticism Of Criticism, but I think they are distinct in that they can be quite narrow.
  5. One of the main things these questions seem to do is raise the reader’s uncertainty about the core thing being criticized, similar to the Just Asking Questions phenomenon. To me, Dangling Questions seem like a more intellectual version of Just Asking Questions — much more easily disguised as a good argument.

Here's another example, though it's imperfect.

Example

From an AI Snake Oil blog post:

Research on scaling laws shows that as we increase model size, training compute, and dataset size, language models get “better”. … But this is a complete misinterpretation of scaling laws. What exactly is a “better” model? Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.

Explanation of Example

The argument being implied is something like “scaling laws are only about perplexity, but perplexity is different from the metric we actually care about — how much? who knows? —, so you should ignore everything related to perplexity, also consider going on a philosophical side-quest to figure out what ‘better’ really means. We think ‘better’ is about emergent abilities, and because they’re emergent we can’t predict them so who knows if they will continue to appear as we scale up”. In this case, the authors have ventured an answer to their Dangling Question, “what is a ‘better’ model?“, they’ve said it’s one with more emergent capabilities than a previous model. This answer seems flat out wrong to me; acceptable answers include: downstream performance, self-reported usefulness to users, how much labor-time it could save when integrated in various people’s work, ability to automate 2022 job tasks, being more accurate on factual questions, and much more. I basically expect nobody to answer the question “what does it mean for one AI system to be better than another?” with “the second has more capabilities that were difficult to predict based on the performance of smaller models and seem to increase suddenly on a linear-performance, log-compute plot”.

Even given the answer “emergent abilities”, the authors fail to actually argue that we don’t have a scaling precedent for these. Again, I think the focus on emergent abilities is misdirected, so I’ll instead discuss the relationship between perplexity and downstream benchmark performance — I think this is fair game because this is a legitimate answer to the “what counts as ‘better’?” question and because of the original line “Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence”. The quoted thing is technically true but in this context highly misleading, because we can, in turn, draw clear relationships between perplexity and downstream benchmark performance; here are three recent papers which do so, here are even more studies that relate compute directly to downstream performance on non-perplexity metrics. Note that some of these are cited in the blog post. I will also note that this seems like one example of a failure I’ve seen a few times where people conflate “scaling laws” with what I would refer to as “scaling trends” where the scaling laws refer to specific equations for predicting various metrics based on model inputs such as # parameters and amount of data to predict perplexity, whereas scaling trends are the more general phenomenon we observe that scaling up just seems to work and in somewhat predictable ways; the scaling laws are useful for the predicting, but whether we have those specific equations or not has no effect on this trend we are observing, the equations just yield a bit more precision. Yes, scaling laws relating parameters and data to perplexity or training loss do not directly give you info about downstream performance, but we seem to be making decent progress on the (imo still not totally solved) problem of relating perplexity to downstream performance, and together these mean we have somewhat predictable scaling trends for metrics that do matter.

Example

Here’s another example from that blog post where the authors don’t literally pose a question, but they are still doing the Dangling Question thing in many ways. (context is referring to these posts):

Also, like many AI boosters, he conflates benchmark performance with real-world usefulness.

Explanation of Example

(Perhaps it would be better to respond to the linked AI Snake Oil piece, but that’s a year old and lacks lots of important evidence we have now). I view the move being made here as posing the question “but are benchmarks actually useful to real world impact?“, assuming the answer is no — or poorly arguing so in the linked piece — and going on about your day. It’s obviously the case that benchmarks are not the exact same as real world usefulness, but the question of how closely they’re related isn’t some magic black box of un-solvability! If the authors of this critique want to complain about the conflation between benchmark performance and real-world usefulness, they should actually bring the receipts showing that these are not related constructs and that relying on benchmarks would lead us astray. I think when you actually try that, you get an answer like: benchmark scores seem worse than user’s reported experience and than user’s reported usefulness in real world applications, but there is certainly a positive correlation here; we can explain some of the gap via techniques like few-shot prompting that are often used for benchmarks, a small amount via dataset contamination, and probably much of this gap comes from a validity gap where benchmarks are easy to assess but unrealistic, but thankfully we have user-based evaluations like LMSYS that show a solid correlation between benchmark scores and user experience, … (if I actually wanted to make the argument the authors were, I would be spending like >5 paragraphs on it and elaborating on all of the evidences mentioned above, including talking more about real world impacts, this is actually a difficult question and the above answer is demonstrative rather than exemplar)

Caveats and Potential Solutions

There is room for questions in critiques. Perfect need not be the enemy of good when making a critique. Dangling Questions are not always made in bad faith. 

Many of the people who pose Dangling Questions like this are not trying to act in bad faith. Sometimes they are just unserious about the overall question, and they don’t care much about getting to the right answer. Sometimes Dangling Questions are a response to being confused and not having tons of time to think through all the arguments, e.g., they’re a psychological response something like “a lot feels wrong about this, here are some questions that hint at what feels wrong to me, but I can’t clearly articulate it all because that’s hard and I’m not going to put in the effort”.

My guess at a mental move which could help here: when you find yourself posing a question in the context of an argument, ask whether you care about the answer, ask whether you should spend a few minutes trying to determine the answer, ask whether the answer to this question would shift your beliefs about the overall argument, ask whether the question puts undue burden on your interlocutor. 

If you’re thinking quickly and aren’t hoping to construct a super solid argument, it’s fine to have Dangling Questions, but if your goal is to convince others of your position, you should try to answer your key questions, and you should justify why they matter to the overall argument. 

Another example of me posing a Dangling Question in this:

What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. 

Explanation of Example

(I’m not sure equating GPT-5 with a ~5b training run is right). In the above quote, I’m arguing against The Scaling Picture by asking whether anybody will keep investing money if we see only marginal gains after the next (public) compute jump. I think I spent very little time trying to answer this question, and that was lame (though acceptable given this was a Quick Take and not trying to be a strong argument). I think for an argument around this to actually go through, I should argue: without much larger dollar investments, The Scaling Picture won’t hold; those dollar investments are unlikely conditional on GPT-5 not being much better than GPT-4. I won’t try to argue these in depth, but I do think some compelling evidence is that OpenAI is rumored to be at ~$3.5 billion annualized revenue, and this plausibly justifies considerable investment even if the GPT-5 gain over this isn’t tremendous. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-07-15T17:00:59.720Z · LW · GW

I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include "use private platform data like messaging apps"), most of these don't change the picture because they don't move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x. 

Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-07-15T07:04:50.608Z · LW · GW

I sometimes want to point at a concept that I've started calling The Scaling Picture. While it's been discussed at length (e.g., here, here, here), I wanted to give a shot at writing a short version:

  • The picture:
    • We see improving AI capabilities as we scale up compute, projecting the last few years of progress in LLMs forward might give us AGI (transformative economic/political/etc. impact similar to the industrial revolution; AI that is roughly human-level or better on almost all intellectual tasks) later this decade (note: the picture is not about specific capabilities so much as the general picture).
    • Relevant/important downstream capabilities improve as we scale up pre-training compute (size of model and amount of data), although for some metrics there are very sublinear returns — this is the current trend. Therefore, you can expect somewhat predictable capability gains in the next few years as we scale up spending (increase compute), and develop better algorithms / efficiencies.
    • AI capabilities in the deep learning era are the result of three inputs: data, compute, algorithms. Keeping algorithms the same, and scaling up the others, we get better performance — that's what scaling means. We can lump progress in data and algorithms together under the banner "algorithmic progress" (i.e., how much intelligence can you get per compute) and then to some extent we can differentiate the source of progress: algorithmic progress is primarily driven by human researchers, while compute progress is primarily driven by spending more money to buy/rent GPUs. (this may change in the future). In the last few years of AI history, we have seen massive gains in both of these areas: it's estimated that the efficiency of algorithms has improved about 3x/year, and the amount of compute used has increased 4.1x/year. These are ludicrous speeds relative to most things in the world. 
    • Edit to add: The below arguments are just supposed to be pointers toward longer argument one could make, but the one sentence version usually isn't compelling on its own.
  • Arguments for:
    • Scaling laws (mathematically predictable relationship between pretraining compute and perplexity) have held for ~12 orders of magnitude already
    • We are moving though ‘orders of magnitude of compute’ quickly, so lots of probability mass should be soon (this argument is more involved, following from having uncertainty over orders of magnitude of compute that might be necessary for AGI, like the approach taken here; see here for discussion)
    • Once you get AIs that can speed up AI progress meaningfully, progress on algorithms could go much faster, e.g., by AIs automating the role of researchers at OpenAI. You also get compounding economic returns that allow compute to grow — AIs that can be used to make a bunch of money, and that money can be put into compute. It seems plausible that you can get to that level of AI capabilities in the next few orders of magnitude, e.g., GPT-5 or GPT-6. Automated researchers are crazy.
    • Moore’s law has held for a long time. Edit to add: I think a reasonable breakdown for the "compute" category mentioned above is "money spent" and "FLOP purchasable per dollar". While Moore's Law is technically about the density of transistors, the thing we likely care more about is FLOP/$, which follows similar trends. 
    • Many people at AGI companies think this picture is right, see e.g., this, this, this (can’t find an aggregation)
  • Arguments against:
    • Might run out of data. There are estimated to be 100T-1000T internet tokens, we will likely hit this level in a couple years.
    • Might run out of money — we’ve seen ~$100m training runs, we’re likely at $100m-1b this year, tech R&D budgets are ~30B, governments could fund $1T. One way to avoid this 'running out of money' problem is if you get AIs that speed up algorithmic progress sufficiently.
    • Scaling up is a non-trivial engineering problem and it might cause slow downs due to e.g., GPU failure and difficulty parallelizing across thousands of GPUs
    • Revenue might just not be that big and investors might decide it's not worth the high costs
      • OTOH, automating jobs is a big deal if you can get it working
    • Marginal improvements (maybe) for huge increased costs; bad ROI. 
      • There are numerous other economics arguments against, mainly arguing that huge investments in AI will not be sustainable, see e.g., here
    • Maybe LLMs are missing some crucial thing
      • Not doing true generalisation to novel tasks in the ARC-AGI benchmark
      • Not able to learn on the fly — maybe long context windows or other improvements can help
      • Lack of embodiment might be an issue
    • This is much faster than many AI researchers are predicting
    • This runs counter to many methods of forecasting AI development
    • Will be energy intensive — might see political / social pressures to slow down. 
    • We might see slowdowns due to safety concerns.
Comment by Aaron_Scher on An issue with training schemers with supervised fine-tuning · 2024-07-14T02:57:01.959Z · LW · GW

Neat idea. I notice that this looks similar to dealing with many-shot jailbreaking:

For jailbreaking you are trying to learn the policy "Always imitate/generate-from a harmless assistant", here you are trying to learn "Always imitate safe human". In both, your model has some prior for outputting harmful next tokens, the context provides an update toward a higher probability of outputting harmful text (because of seeing previous examples of the assistant doing so, or because the previous generations came from an AI). And in both cases we would like some training technique that causes the model's posterior on harmful next tokens to be low. 

I'm not sure there's too much else of note about this similarity, but it seemed worth noting because maybe progress on one can help with the other. 

Comment by Aaron_Scher on Distillation of 'Do language models plan for future tokens' · 2024-06-27T23:36:34.145Z · LW · GW

Cool! I'm not very familiar with the paper so I don't have direct feedback on the content — seems good. But I do think I would have preferred a section at the end with your commentary / critiques of the paper, also that's potentially a good place to try and connect the paper to ideas in AI safety. 

Comment by Aaron_Scher on Claude 3.5 Sonnet · 2024-06-23T23:31:40.226Z · LW · GW

It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.

That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).

Comment by Aaron_Scher on Claude 3.5 Sonnet · 2024-06-23T18:38:01.988Z · LW · GW

Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement.

...

While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold at which we will run the full evaluation protocol described in our Responsible Scaling Policy (RSP).

Hmmm, maybe the 4x effective compute threshold is too large given that you're getting near doubling of agentic task performance (on what I think is an eval with particularly good validity) but not hitting the threshold. 

Or maybe at the very least you should make some falsifiable predictions that might cause you to change this threshold. e.g., "If we train a model that has downstream performance (on any of some DC evals) ≥10% higher than was predicted by our primary prediction metric, we will revisit our prediction model and evaluation threshold." 

It is unknown to me whether Sonnet 3.5's performance on this agentic coding evaluation was predicted in advance at Anthropic. It seems wild to me that you can double your performance on a high validity ARA-relevant evaluation without triggering the "must evaluate" threshold; I think evaluation should probably be required in that case, and therefore, if I had written the 4x threshold, I would be reducing it. But maybe those who wrote the threshold were totally game for these sorts of capability jumps? 

Comment by Aaron_Scher on Claude 3.5 Sonnet · 2024-06-23T18:08:32.035Z · LW · GW

Can you say more about why you would want this to exist? Is it just that "do auto-interpretability well" is a close proxy for "model could be used to help with safety research"? Or are you also thinking about deception / sandbagging, or other considerations. 

Comment by Aaron_Scher on Getting 50% (SoTA) on ARC-AGI with GPT-4o · 2024-06-17T19:36:08.283Z · LW · GW

Nice! Do you have a sense of the total development (and run-time) cost of your solution? "Actually getting to 50% with this main idea took me about 6 days of work." I'm interested in the person-hours and API calls cost of this. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-06-10T17:08:57.499Z · LW · GW

Hm, can you explain what you mean? My initial reaction is that AI oversight doesn't actually look a ton like this position of the interior where defenders must defend every conceivable attack whereas attackers need only find one successful strategy. A large chunk of why I think these are disanalogous is that getting caught is actually pretty bad for AIs — see here.

Comment by Aaron_Scher on What if a tech company forced you to move to NYC? · 2024-06-09T18:03:50.084Z · LW · GW

Not sure I love this analogy — moving to NYC doesn't seem like that big of a deal —, but I do think it's pretty messed up to be imposing huge social / technological / societal changes on 8 billion of your peers. I expect most of the people building AGI have not really grasped the ethical magnitude of doing this — I think I sort of have, but also I don't build AGI. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-06-08T23:32:01.906Z · LW · GW

Note on something from the superalignment section of Leopold Aschenbrenner's recent blog posts:

Evaluation is easier than generation. We get some of the way “for free,” because it’s easier for us to evaluate outputs (especially for egregious misbehaviors) than it is to generate them ourselves. For example, it takes me months or years of hard work to write a paper, but only a couple hours to tell if a paper someone has written is any good (though perhaps longer to catch fraud). We’ll have teams of expert humans spend a lot of time evaluating every RLHF example, and they’ll be able to “thumbs down” a lot of misbehavior even if the AI system is somewhat smarter than them. That said, this will only take us so far (GPT-2 or even GPT-3 couldn’t detect nefarious GPT-4 reliably, even though evaluation is easier than generation!)

Disagree about papers. I don’t think it takes merely a couple hours to tell if a paper is any good. In some cases it does, but in other cases, entire fields have been led astray for years due to bad science (e.g., replication crisis in psych, where numerous papers spurred tons of follow up work on fake things; a year and dozens of papers later we still don’t know if DPO is better than PPO for frontier AI development (though perhaps this is known in labs, and my guess is some people would argue this question is answered); IIRC it took like 4-8 months for the alignment community to decide CCS was bad (this is a contentious and oversimplifying take), despite many people reading the original paper). Properly vetting a paper in the way you will want to do for automated alignment research, especially if you’re excluding fraud from your analysis, is about knowing whether the insights in the paper will be useful in the future, it’s not just checking if they use reasonable hyperparameters on their baseline comparisons. 

One counterpoint: it might be fine to have some work you mistakenly think is good, as long as it’s not existential-security-critical and you have many research directions being explored in parallel. That is, because you can run tons of your AIs at once, they can explore tons of research directions and do a bunch of the follow-up work that is needed to see if an insight is important. There may not be a huge penalty for having a slightly poor training signal, as long as it can get the quality of outputs good enough. 

This [how easily can you evaluate a paper] is a tough question to answer — I would expect Leopold’s thoughts here to dominated by times he has read shitty papers, rightly concluded they are shitty, and patted himself on the back for his paper-critique skills — I know I do this. But I don’t expect being able to differentiate shitty vs. (okay + good + great) is enough. At a meta level, this post is yet another claim that "evaluation is easier than generation" will be pretty useful for automating alignment — I have grumbled about this before (though can't find anything I've finished writing up), and this is yet another largely-unsubstantiated claim in that direction. There is a big difference between the claims "because evaluation is generally easier than generation, evaluating automated alignment research will be a non-zero amount easier than generating it ourselves" and "the evaluation-generation advantage will be enough to significantly change our ability to automate alignment research and is thus a meaningful input into believing in the success of an automated alignment plan"; the first is very likely true, but the second maybe not. 

On another note, the line “We’ll have teams of expert humans spend a lot of time evaluating every RLHF example” seems absurd. It feels a lot like how people used to say “we will keep the AI in a nice sandboxed environment”, and now most user-facing AI products have a bunch of tools and such. It sounds like an unrealistic safety dream. This also sounds terribly inefficient — it would only work if your model is very sample-efficiently learning from few examples — which is a particular bet I’m not confident in. And my god, the opportunity cost of having your $300k engineers label a bunch of complicated data! It looks to me like what labs are doing for self play (I think my view is based on papers out of meta and GDM) is having some automated verification like code passing unit tests, and using a ton of examples. If you are going to come around saying they’re going to pivot from ~free automated grading to using top engineers for this, the burden of proof is clearly on you, and the prior isn’t so good.

Comment by Aaron_Scher on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-02T21:02:30.246Z · LW · GW

AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down

Why do you think this? What is the general story you're expecting? 

I think it's plausible that humanity takes a very cautious response to AI autonomy, including hunting and shutting down all autonomous AIs — but I don't think the arguments I'm considering justify more than like 70% confidence (I think I'm somewhere around 60%). Some arguments pointing toward "maybe we won't respond sensibly to ARA": 

  • There are not known-to-me laws prohibiting autonomous AIs from existing (assuming they're otherwise following laws), in any jurisdiction. 
  • Properly dealing with ARA is a global problem, requiring either buy-in from dozens of countries, or somebody to carry out cyber-offensive operations in foreign countries, in order to shut down ARA models. We see precedence for this kind of international action w.r.t. WMD threats like US/Israel's attacks on Iran's nuclear program, and I expect there's a lot of tit-for-tat going on in the nation state hacking world, but it's not obvious that autonomous AIs would rise to a threat level that warrants this. 
  • It's not clear to me that the public cares about autonomous AIs existing in many domains (at least in many domains; there are some domains like dating where people have a real ick). I think if we got credible evidence that Mark Zuckerberg was a lizard or a robot, few people would stop using Facebook products as a result. Many people seem to think various tech CEOs like Elon Musk and Jeff Bezos are terrible, yet still use their products. 
  • A lot of this seems like it depends on whether autonomous AIs actually cause any serious harm. I can definitely imagine a world with autonomous AIs running around like small companies and twitter being filled with "but show me the empirical evidence for risk, all you safety-ists have is your theoretical arguments which haven't held up, and we have tons of historical evidence of small companies not causing catastrophic harm". And indeed, I don't really expect the conceptual arguments for risk from roughly human level autonomous AIs are likely to convince enough of the public + policy makers that they need to take drastic actions to limit autonomous AIs; I definitely wouldn't be highly confident that will will respond appropriately in the absence of serious harm. If the autonomous AIs are basically minding their own business, I'm not sure there will be major effort to limit them. 
Comment by Aaron_Scher on EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024 · 2024-05-25T22:56:52.974Z · LW · GW

I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn't seem to be a better top level post for it) (I have not read paper deeply and could be missing things):

  • I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarking — for example testing if clamping sycophancy features affects performance on sycophancy benchmarks)
  • At an object level, one concerning-to-me result is that there doesn't appear to be a clean gradient in the presence of a feature over the range of activation values. You might hope that if you take the AI risk feature[1], and look at dataset examples that span its activation values (as the tool does), you would see highly activating text be very related to AI risk and low activating text be only slightly related. I think that pattern is weak — there are at least some low-activation examples that are highly related to AI risk, such as '..."It's what they're programmed to do." "Destroy all technology other than their own"' (cherrypicked by me). This is related to sensitivity, which the paper mentions is difficult to study in this context (before mentioning one cherry-picked result). I care about this because: one way to use SAEs for safety is as a classifier for malicious behavior (be checking if model activations correspond to dangerous features); this would really benefit from having a nice smooth relationship between feature activation magnitude and actual feature presence, and it pretty much needs to have high sensitivity. Given the existence of highly-feature-related samples in the bottom activation interval, I feel fairly worried that sensitivity is poor, and that it will be hard to do magnitude-based thresholds — it pretty much looks like 0 is the reasonable threshold given these results. 
  1. ^

    In the paper this is labeled with "The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity"

Comment by Aaron_Scher on AI Safety proposal - Influencing the superintelligence explosion · 2024-05-23T05:58:48.793Z · LW · GW

I don’t have strong takes, but you asked for feedback.

It seems nontrivial that the “value proposition” of collaborating with this brain-chunk is actually net positive. E.g., if it involved giving 10% of the universe to humanity, that’s a big deal. Though I can definitely imagine where taking such a trade is good.

It would likely help to devise more clarity about why the brain-chunk provides value. Is it because humanity has managed to coordinate to get a vast majority of high performance compute under the control of a single entity and access to compute is what’s being offered? If we’re at that point, I think we probably have many better options (e.g., long term moratorium and coordinated safety projects).

Another load bearing part seems to be the brain-chunk causing the misaligned AI to become or remain somewhat humanity friendly. What are the mechanisms here? The most obvious thing to me is that AI submits jobs to the cluster along with a thorough explanation of why they will create a safe successor system, and then the brain-chunk is able to assess these plans and act as a filter, only allowing safer-seeming training runs to happen. But if we’re able to accurately assess the viability of safe AGI design plans that are proposed by a human+ level (and potentially malign) AGIs, great, we probably don’t need this complicated scheme where we let a potentially malign undergo rsi.

Again, no strong feelings, but the above do seem like weaknesses. I might have understood things you were saying. I do wish there was more work thinking about standard trades with misaligned AIs, but perhaps this is going on privately.

Comment by Aaron_Scher on Strong Evidence is Common · 2024-05-21T16:56:36.354Z · LW · GW

I appreciate this comment, especially #3, for voicing some of why this post hasn't clicked for me. 

The interesting hypotheses/questions seem to rarely have strong evidence. But I guess this is partially a selection effect where questions become less interesting by virtue of me being able to get strong evidence about them, no use dwelling on the things I'm highly confident about. Some example hypotheses that I would like to get evidence about but which seem unlikely to have strong evidence: Sam Altman is a highly deceptive individual, far more deceptive than the average startup CEO. I work better when taking X prescribed medication. I would more positively influence the far future if I worked on field building rather than technical research. 

Comment by Aaron_Scher on DeepMind's "​​Frontier Safety Framework" is weak and unambitious · 2024-05-20T23:38:38.096Z · LW · GW

Just chiming in that I appreciate this post, and my independent impressions of reading the FSF align with Zach's conclusions: weak and unambitious

A couple additional notes:

The thresholds feel high — 6/7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can't say whether the thresholds are "too high" without corresponding safety mitigations, which this document doesn't have. (Zach)

These also seemed pretty high to me, which is concerning given that they are "Level 1". This doesn't necessarily imply but it does hint that there won't be substantial mitigations — above the current level — required until those capability levels. My guess is that current jailbreak prevention is insufficient to mitigate substantial risk from models that are a little under the level 1 capabilities for e.g., bio. 

GDP gets props for specifically indicating ML R&D + "hyperbolic growth in AI capabilities" as a source of risk. 

Given the lack of commitments, it's also somewhat unclear what scope to expect this framework to eventually apply to. GDM is a large org with, presumably, multiple significant general AI capabilities projects. Especially given that "deployment" refers to external deployment, it seems like there's going to be substantial work to ensuring that all the internal AI development projects proceed safely. e.g., when/if there are ≥3 major teams and dozens of research projects working on fine-tuning highly capable models (e.g., base model just below level 1), compliance may be quite difficult. But this all depends on what the actual commitments and mechanisms turn out to be. This comes to mind after this event a few weeks ago, where it looks like a team at Microsoft released a model without following all internal guidelines, and then tried to unrelease it (but I could be confused). 

Comment by Aaron_Scher on RobertM's Shortform · 2024-05-14T01:09:44.960Z · LW · GW

Sam Altman and OpenAI have both said they are aiming for incremental releases/deployment for the primary purpose of allowing society to prepare and adapt. Opposed to, say, dropping large capabilities jumps out of the blue which surprise people. 

I think "They believe incremental release is safer because it promotes societal preparation" should certainly be in the hypothesis space for the reasons behind these actions, along with scaling slowing and frog-boiling. My guess is that it is more likely than both of those reasons (they have stated it as their reasoning multiple times; I don't think scaling is hitting a wall).

Comment by Aaron_Scher on Refusal in LLMs is mediated by a single direction · 2024-04-29T17:55:37.639Z · LW · GW

This might be a dumb question(s), I'm struggling to focus today and my linear algebra is rusty. 

  1. Is the observation that 'you can do feature ablation via weight orthogonalization' a new one? 
  2. It seems to me like this (feature ablation via weight orthogonalization) is a pretty powerful tool which could be applied to any linearly represented feature. It could be useful for modulating those features, and as such is another way to do ablations to validate a feature (part of the 'how do we know we're not fooling ourselves about our results' toolkit). Does this seem right? Or does it not actually add much? 
Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-04-24T16:28:34.483Z · LW · GW

Thinking about AI training runs scaling to the $100b/1T range. It seems really hard to do this as an independent AGI company (not owned by tech giants, governments, etc.). It seems difficult to raise that much money, especially if you're not bringing in substantial revenue or it's not predicted that you'll be making a bunch of money in the near future. 

What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. Were Anthropic founders pricing in that they're likely not going to be independent by the time they hit AGI — does this still justify the existence of a separate safety-oriented org?  

This is not a new idea, but I feel like I'm just now taking some of it seriously. Here's Dario talking about it recently

I basically do agree with you. I think it’s the intellectually honest thing to say that building the big, large scale models, the core foundation model engineering, it is getting more and more expensive. And anyone who wants to build one is going to need to find some way to finance it. And you’ve named most of the ways, right? You can be a large company. You can have some kind of partnership of various kinds with a large company. Or governments would be the other source.

Now, maybe the corporate partnerships can be structured so that AGI companies are still largely independent but, idk man, the more money invested the harder that seems to make happen. Insofar as I'm allocating probability mass between 'acquired by big tech company', 'partnership with big tech company', 'government partnership', and 'government control', acquired by big tech seems most likely, but predicting the future is hard. 

Comment by Aaron_Scher on How to Model the Future of Open-Source LLMs? · 2024-04-22T18:05:42.684Z · LW · GW

Um, looking at the scaling curves and seeing diminishing returns? I think this pattern is very clear for metrics like general text prediction (cross-entropy loss on large texts), less clear for standard capability benchmarks, and to-be-determined for complex tasks which may be economically valuable. 

To be clear, I'm not saying that a $100m model will be very close to a $1b model. I'm saying that the trends indicate they will be much closer than you would think if you only thought about how big a 10x difference in training compute is, without being aware of the empirical trends of diminishing returns. The empirical trends indicate this will be a relatively small difference, but we don't have nearly enough data for economically valuable tasks / complex tasks to be confident about this. 

Comment by Aaron_Scher on How to Model the Future of Open-Source LLMs? · 2024-04-22T17:48:03.468Z · LW · GW

Yeah, these developments benefit close-sourced actors too. I think my wording was not precise, and I'll edit it. This argument about algorithmic improvement is an argument that we will have powerful open source models (and powerful closed-source models), not that the gap between these will necessarily shrink. I think both the gap and the absolute level of capabilities which are open-source are important facts to be modeling. And this argument is mainly about the latter. 

Comment by Aaron_Scher on How to Model the Future of Open-Source LLMs? · 2024-04-19T21:16:55.598Z · LW · GW

Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me: 

  • Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise. 
  • There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, and they hurt your closed-source competitors. 
  • I expect there is still tons of low-hanging fruit available in LLM capabilities land. You could call this "algorithmic progress" if you want. This will decrease the compute cost necessary to get a given level of performance, thus raising the AI capability level accessible to less-resourced open-source AI projects. [edit: but not exclusively open-source projects (this will benefit closed developers too). This argument is about the absolute level of capabilities available to the public, not about the gap between open and closed source.]
Comment by Aaron_Scher on A Review of In-Context Learning Hypotheses for Automated AI Alignment Research · 2024-04-19T16:51:58.374Z · LW · GW

The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.

I'm not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model's no-context priors are that it's doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknown role. Meanwhile, actually figuring out how to do a good job in the previously-unknown role would require piecing together other knowledge the model has — and sufficiently strong building blocks would allow a lot of learning of new concepts. 

Comment by Aaron_Scher on LLM Evaluators Recognize and Favor Their Own Generations · 2024-04-18T17:11:16.433Z · LW · GW

For example, if the GPT-4 evaluator gave a weighted score of  to a summary generated by Claude 2 and a weighted score of  to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be .

Should this be 3/(2+3) = 0.6? Not sure I've understood correctly.

Comment by Aaron_Scher on Creating unrestricted AI Agents with Command R+ · 2024-04-18T00:48:30.293Z · LW · GW

I expect a lot more open releases this year and am committed to test their capabilities and safety guardrails rigorously.

Glad you're planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc. 

Comment by Aaron_Scher on Aaron_Scher's Shortform · 2024-03-29T20:40:11.313Z · LW · GW

Slightly Aspirational AGI Safety research landscape 

This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is. 

  • Interpretability / understanding model internals
    • Circuit interpretability
    • Superposition study
    • Activation engineering
    • Developmental interpretability
  • Understanding deep learning
    • Scaling laws / forecasting
    • Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
    • Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
    • Understanding normal but poorly understood things, like in context learning
    • Understanding weird phenomenon in deep learning, like this paper
    • Understand how various HHH fine-tuning techniques work
  • AI Control
    • General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
    • Unlearning
    • Steganography prevention / CoT faithfulness
    • Censorship study (how censoring AI models affects performance; and similar things)
  • Model organisms of misalignment
    • Demonstrations of deceptive alignment and sycophancy / reward hacking
    • Trojans
    • Alignment evaluations
    • Capability elicitation
  • Scaling / scalable oversight
    • RLHF / RLAIF
    • Debate, market making, imitative generalization, etc. 
    • Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
    • Weak to strong generalization
    • General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling 
  • Robustness
    • Anomaly detection 
    • Understanding distribution shifts and generalization
    • User jailbreaking
    • Adversarial attacks / training (generally), including latent adversarial training
  • AI Security 
    • Extracting info about models or their training data
    • Attacking LLM applications, self-replicating worms
  • Multi-agent safety
    • Understanding AI in conflict situations
    • Cascading failures
    • Understanding optimization in multi-agent situations
    • Attacks vs. defenses for various problems
  • Unsorted / grab bag
    • Watermarking and AI generation detection
    • Honesty (model says what it believes) 
    • Truthfulness (only say true things, aka accuracy improvement)
    • Uncertainty quantification / calibration
    • Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)

Don’t quite make the list: 

  • Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority. 
  • Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality. 
Comment by Aaron_Scher on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-22T06:50:33.918Z · LW · GW

There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.

Comment by Aaron_Scher on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-12T22:18:53.614Z · LW · GW

Thanks for your response!

I'll think more about the outer shell stuff, it's possible that my objection actually arises with the consequentialist assumption, but I'm not sure. 

It's on my todo list to write a comment responding to some of the specifics of Redwood's control post.

I would be excited to read this / help with a draft. 

Yes, approximately, as I believe you and I are capable of doing.

The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It's pretty unclear to me if I could pull off deception like this, it seems pretty hard imo. 

Seems easy enough to predict given roughly human-scientist-team level of capabilities.

One situation I'm imagining here is that we've (explicitly) trained our AI on scientific output from January to March, and we're now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they're not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we'll get with AIs rather than the bar being at "human level" it's more like "what has been accomplished by the humans+AIs in the outside world, for which I've received untrustworthy information". And every time the AIs do useful work we can add this to our supervised training corpus. 

> The data need not be human generated, it could be generated by other AIs, as long as we trust it. 

?? This seems to be assuming a solution to the problem.

One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other 'scalable oversight'), and then you use this research as supervised / labeled data for training GPT-6 to do good work. 

Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn't seem like it would help with any problem that we were focused on in this post.

Here's a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), "We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations". My interpretation is that they found a probably-causal link between 'the model is attending to relevant tokens' and 'the model gets the answer correct', because duh, it's like how when you give somebody a riddle you can tell whether they're on track based on what parts of the question they're focused on. While it's probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it's probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start. 

"leaning more on generalization" ≈ "leaning more on agency",

That's not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from "leaning more on agency". 

The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.

I'm not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is "Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans". And if we could change this AI to have the goal "Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans", that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of 'supervision', thus making it easier to elicit good work from AIs (ideally not just 'looks good'). 

I think that there is a lot of room for the evidence to be ambiguous and controversial

I agree that this is a major potential problem which I think is pretty likely to occur; maybe I'll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it's also true that we'll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk. 

I buy this for many environments and levels of power, but it's obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that's the sort of AI we get if it can undergo self-improvement.

I don't feel like I'm talking about AIs which have "taking-over-the-universe in their easily-within-reach options". I think this is not within reach of the current employees of AGI labs, and the AIs I'm thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes). 

Comment by Aaron_Scher on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-02-10T04:16:49.585Z · LW · GW

Caveat that I have only skimmed this. I don't think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I'm not the main audience. My reactions:

  • Section 1: Useful tasks, like novel science, are hard - sure
  • Section 2: Being capable of hard tasks implies approximate consequentialism - I'm not very high confidence on this, but happy to go with it
  • Section 3: Hard tasks require learning new things - agree that learning will be important
  • Section 4: Behavioral training is an imprecise way to specify goals - most the sub-arguments have me going "shrug, maybe". I definitely don't feel confident that any of the mentioned problems will arise in practice. I don't see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative. 
  • Section 5: Control of misaligned AIs is difficult - This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood's recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3). 
    • "Misaligned goal-directed AI won’t cooperate with humans" Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it's plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there's definitely some other solution space here.]. I think this subsection title is possible but unlikely. 
    • "There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities." Well, no. You either need sufficient labeled data in the domain we're trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it. 
    • "It only applies to relatively easy to check research results" seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training. 
    • "In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top." I agree that we don't have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that's not totally asymmetric, it also gives us hope about our ability to modify AI goals. 
    • "It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps." I agree that capability elicitation seems non-trivial in some cases, but I'm reasonably optimistic that future work will improve the situation. 
    • Many of the issues in this section are things that, if we're not being totally idiots, it seems we'll get substantial warning about. e.g., AIs colluding with their AI monitors. That's definitely a positive, though far from conclusive. 
  • Section 6: Powerful misaligned AI would be bad - I think a couple of the arguments here continue to be legitimate, such as "Unclear that many goals realistically incentivise taking over the universe", but I'm overall fine accepting this section. 

Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn't really engage with those arguments in a way that sways me much. 

Comment by Aaron_Scher on The case for ensuring that powerful AIs are controlled · 2024-02-08T19:31:26.982Z · LW · GW

This is great, thanks for writing it!

  1. Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: "Improve capability elicitation techniques". Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers' job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored. 
  2. [Unsure if the ways I'm thinking about this are productive/correct, but I think it's broadly correct] One slightly non-obvious component of "Your sensory inputs are unreliable" is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it's in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts). 
    1. By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup). 
    2. Example in the HHH context: "User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include..." 
    3. I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we're teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to "I'm sorry I can't help with that", at every single chance you get. 
Comment by Aaron_Scher on Evaluating Stability of Unreflective Alignment · 2024-02-06T18:31:40.146Z · LW · GW

6d. Evaluate Censorship of Convergent Thoughts

Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation. 

What happens when you iteratively finetune on censored text? Do models forget the censored behavior? 

How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature? 

For the example you gave where the model may find the solution of "ten" instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes? 

Comment by Aaron_Scher on Predicting AGI by the Turing Test · 2024-01-29T11:05:33.303Z · LW · GW

it would need  that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost  USD, or 200 years of global GDP2023.

This implies that the first AGI will not be a scaled-up GPT -- autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,[6] with even 50% of global GDP devoted,[7] and with 2% growth rate forever, it would still take 110 years,[8] arriving at year 2133. Whole brain emulation would likely take less time

I don't think this is the correct response to thinking you need 9 OOMs more compute. 9 OOMs is a ton, and also, we've been getting an OOM every few years, excluding OOMs that come from $ investment. I think if you believe we would need 9 OOMs more than GPT-4, this is a substantial update away from expecting LLMs to scale to AGI in the next say 8 years, but it's not a strong update against 10-40 year timelines. 

I think it's easy to look at large number like 9 OOMs and say things like "but that requires substantially more energy than all of humanity produces each year — no way!" (a thing I have said before in a similar context). But this thinking ignores the dropping cost of compute and strong trends going on now. Building AGI with 2024 hardware might be pretty hard, but we won't be stuck with 2024 hardware for long. 

Comment by Aaron_Scher on MATS Summer 2023 Retrospective · 2023-12-02T18:58:40.567Z · LW · GW

On average, scholars self-reported understanding of major alignment agendas increased by 1.75 on this 10 point scale. Two respondents reported a one-point decrease in their ratings, which could be explained by some inconsistency in scholars’ self-assessments (they could not see their earlier responses when they answered at the end of the Research Phase) or by scholars realizing their previous understanding was not as comprehensive as they had thought.

This could also be explained by these scholars actually having a worse understanding after the program. For me, MATS caused me to focus a bunch on a few particular areas and spend less time at the high level / reading random LW posts, which plausibly has the effect of reducing my understanding of major alignment agendas. 

This could happen because of forgetting what you previously knew, or various agendas changing and you not keeping up with it. My guess is that there are many 2 month time periods in which a researcher will have a worse understanding of major research agendas at the end than at the beginning — though on average you want the number to go up. 

Comment by Aaron_Scher on Non-myopia stories · 2023-11-15T02:46:58.838Z · LW · GW

It's unclear if you've ordered these in a particular way. How likely do you think they each are? My ordering from most to least likely would probably be:

  • Inductive bias toward long-term goals
  • Meta-learning
  • Implicitly non-myopic objective functions
  • Simulating humans
  • Non-myopia enables deceptive alignment
  • (Acausal) trade

Why do you think this:

Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about. 

Who says we don't want non-myopia, those safety people?! I guess to me it looks like the most likely reason we get non-myopia is that we don't try that hard not to. This would be some combination of Meta-learning, Inductive bias toward long-term goals, and Implicitly non-myopic objective functions, as well as potentially "Training for non-myopia". 

Comment by Aaron_Scher on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-07T03:28:26.968Z · LW · GW

I think this essay is overall correct and very important. I appreciate you trying to protect the epistemic commons, and I think such work should be compensated. I disagree with some of the tone and framing, but overall I believe you are right that the current public evidence for AI-enabled biosecurity risks is quite bad and substantially behind the confidence/discourse in the AI alignment community. 

I think the more likely hypothesis for the causes of this situation isn't particularly related to Open Philanthropy and is much more about bad social truth seeking processes. It seems like there's a fair amount of vibes-based deferral going on, whereas e.g., much of the research you discuss is subtly about future systems in a way that is easy to miss. I think we'll see more and more of this as AI deployment continues and the world gets crazier — the ratio of "things people have carefully investigated" to "things they say" is going to drop. I expect in this current case, many of the more AI alignment people didn't bother looking too much into the AI-biosecurity stuff because it feels outside their domain and more-so took headline results at face value, I definitely did some of this. This deferral process is exacerbated by the risk of info hazards and thus limited information sharing. 

Comment by Aaron_Scher on Thoughts on open source AI · 2023-11-03T19:54:11.819Z · LW · GW

I think this reasoning ignores the fact that at the time someone first tries to open source a system of capabilities >T, the world will be different in a bunch of ways. For example, there will probably exist proprietary systems of capabilities .

I think this is likely but far from guaranteed. The scaling regime of the last few years involves strongly diminishing returns to performance from more compute. The returns are coming (scaling works), but it gets more and more expensive to get marginal capability improvements. 

If this trend continues, it seems reasonable to expect the gap between proprietary models and open source to close given that you need to spend strongly super-linearly to keep a constant lead (measured by perplexity, at least). There's questions about whether there will be an incentive to develop open source models at the $ billion+ cost, and I don't know, but it does seem like proprietary project will also be bottlenecked at the 10-100B range (and they also have this problem of willingness to spend given how much value they can capture). 

Potential objections: 

  • We may first enter the "AI massively accelerates ML research" regime causing the leading actors to get compounding returns and keep a stronger lead. Currently I think there's a >30% chance we hit this in the next 5 years of scaling. 
  • Besides accelerating ML research, there could be other features of GPT-SoTA that cause them to be sufficiently better than open source models. Mostly I think the prior of current trends is more compelling, but there will probably be considerations that surprise me. 
  • Returns to downstream performance may follow different trends (in particular not having the strongly diminishing returns to scaling) than perplexity. Shrug this doesn't really seem to be the case, but I don't have a thorough take.
  • These $1b / SoTA-2 open source LLMs may not be at the existentially dangerous level yet. Conditional on GPT-SoTA not accelerating ML research considerably, I think it's more likely that the SoTA-2 models are not existentially dangerous, though guessing at the skill tree here is hard. 

I'm not sure I've written this comment as clearly as I want. The main thing is: expecting proprietary systems to remain significantly better than open source seems like a reasonable prediction, but I think the fact that there are strongly diminishing returns to compute scaling in the current regime should cast significant doubt on it. 

Comment by Aaron_Scher on Will releasing the weights of large language models grant widespread access to pandemic agents? · 2023-11-02T19:26:54.519Z · LW · GW

Also this: https://philsci-archive.pitt.edu/22539/1/Montague2023_GUTMBT_final.pdf 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-24T16:32:05.340Z · LW · GW

Only slightly surprised?

We are currently at ASL-2 in Anthropic's RSP. Based on the categorization, ASL-3 is "low-level autonomous capabilities". I think ASL-3 systems probably don't meet the bar of "meaningfully in control of their own existence", but they probably meet the thing I think is more likely:

I think it wouldn’t be crazy if there were AI agents doing stuff online by the end of 2024, e.g., running social media accounts, selling consulting services; I expect such agents would be largely human-facilitated like AutoGPT

I think it's currently a good bet (>40%) that we will see ASL-3 systems in 2024. 

I'm not sure how big of a jump if will be from that to "meaningfully in control of their own existence". I would be surprised if it were a small jump, such that we saw AIs renting their own cloud compute in 2024, but this is quite plausible on my models. 

I think the evidence indicates that this is a hard task, but not super hard. e.g., looking at ARC's report on autonomous tasks, one model partially completes the task of setting up GPT-J via a cloud provider (with human help).

I'll amend my position to just being "surprised" without the slightly, as I think this better captures my beliefs — thanks for the push to think about this more. Maybe I'm at 5-10%.
 

It may help to know how you're operationalizing AIs that are ‘meaningfully aware of their own existence’.

shrug, I'm being vague 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-24T16:02:10.289Z · LW · GW

I think it's pretty unlikely that scaling literally stops working, maybe I'm 5-10% that we soon get to a point where there are only very small or negligible improvements to increasing compute. But I'm like 10-20% on some weaker version. 

A weaker version could look like there are diminishing returns to performance from scaling compute (as is true), and this makes it very difficult for companies to continue scaling. One mechanism at play is that the marginal improvements from scaling may not be enough to produce the additional revenue needed to cover the scaling costs, this is especially true in a competitive market where it's not clear scaling will put one ahead of their competitors. 

In the context of the post, I think it's quite unlikely that I see strong evidence in the next year indicating that scaling has stopped (if only because a year of no-progress is not sufficient evidence). I was more so trying to point to how there [sic] are contingencies which would make OpenAI's adoption of an RSP less safety-critical. I stand by the statements that scaling no longer yielding returns would be such a contingency, but I agree that it's pretty unlikely. 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-23T22:15:41.420Z · LW · GW

Second smaller comment:

I'm not saying LLMs necessarily raise the severity ceiling on either a bio or cyber attack. I think it's quite possible that AIs will do so in the future, but I'm less worried about this on the 2-3 year timeframe. Instead, the main effect is decreasing the cost of these attacks and enabling more actors to execute such attacks. (as noted, it's unclear whether this substantially worsens bio threats) 

if I want to download penetration tools to hack other computers without using any LLM at all I can just do so

Yes, it's possible to launch cyber attacks currently. But with AI assistance it will require less personal expertise and be less costly. I am slightly surprised that we have not seen a much greater amount of standard cybercrime (the bar I was thinking when I wrote this was not the hundreds of deaths bar, it was more like "statistically significant increase in cybercrime / serious deepfakes / misinformation, in a way that concretely impacts the world, compared to previous years"). 

Comment by Aaron_Scher on Some of my predictable updates on AI · 2023-10-23T22:14:22.392Z · LW · GW

Thanks for your comment!

I appreciate you raising this point about whether LLMs alleviate the key bottlenecks on bioterrorism. I skimmed the paper you linked, thought about my previous evidence, and am happy to say that I'm much less certain that I was before. 

My previous thinking for why I believe LLMs exacerbate biosecurity risks:

  • Kevin Esvelt said so on the 80k podcast, see also the small experiment he mentions. Okay evidence. (I have not listened to the entire episode)
  • Anthropic says so in this blog post. Upon reflection I think this is worse evidence than I previously thought (seems hard to imagine seeing the opposite conclusion from their research, given how vague the blog post is; access to their internal reports would help). 

The Montague 2023 paper you link: the main bottleneck to high consequence biological attacks is actually R&D to create novel threats, especially spread testing which needs to take part in a realistic deployment environment. This requires both being purposeful and being having a bunch of resources, so policies need not be focused on decreasing democratic access to biotech and knowledge. 

I don't find the paper's discussion of 'why extensive spread testing is necessary' to be super convincing, but it's reasonable and I'm updating somewhat toward this position. That is, I'm not convinced either way. I would have a better idea if I knew how accurate a priori spread forecasting has been for bioweapons programs in the past. 

I think the "LLMs exacerbate biosecurity risks" still goes through even if I mostly buy Montague's arguments, given that those arguments are partially specific to high consequence attacks. Additionally, Montague is mainly arguing that democratization of biotech / info doesn't help with the main barriers, not that it doesn't help at all:

The reason spread-testing has not previously been perceived as a the defining stage of difficulty in the biorisk chain (see Figure 1), eclipsing all others, is that, until recently, the difficulties associated with the preceding steps in the risk chain were so high as to deter contemplation of the practical difficulties beyond them. With the advance of synthetic biology enabled by bioinformatic inferences on ‘omics data, the perception of these prior barriers at earlier stages of the risk chain has receded.

So I think there exists some reasonable position (which likely doesn't warrant focusing on LLM --> biosecurity risks, and is a bit of an epistemic cop out): LLMs increase biosecurity risks marginally even though they don't affect the main blockers for bio weapon development. 

Thanks again for your comment, I appreciate you pointing this out. This was one of the most valuable things I've read in the last 2 weeks. 

Comment by Aaron_Scher on Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) · 2023-10-20T03:40:23.848Z · LW · GW

In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.

I'm going to read this as "...1 new potential gradient hacking pathway" because I think that's what the section is mainly about. (It appears to me that throughout the section you're conflating mesa-optimization with gradient hacking, but that's not the main thing I want to talk about.)

The following quote indicates at least two potential avenues of gradient hacking: "In an RL context", "supervised learning with adaptive data sampling". These both flow through the gradient hacker affecting the data distribution, but they seem worth distinguishing, because there are many ways a malign gradient hacker could affect the data distribution. 

Broadly, I'm confused about why others (confidently) think gradient hacking is difficult. Like, we have this pretty obvious pathway of a gradient hacker affecting training data. And it seems very likely that AIs are going to be training on their own outputs or otherwise curating their data distribution — see e.g., 

  • Phi, recent small scale success of using lots of synthetic data in pre-training, 
  • Constitutional AI / self-critique, 
  • using LLMs for data labeling and content moderation,
  • The large class of self-play approaches that I often lump together under "Expert-iteration" which involve iteratively training on the best of your previous actions,
  • the fact that RLHF usually uses a preference model derived from the same base/SFT model being trained. 

Sure, it may be difficult to predictably affect training via partial control over the data distribution. Personally I have almost zero clue how to affect model training via data curation, so my epistemic state is extremely uncertain. I roughly feel like the rest of humanity is in a similar position — we have an incredibly poor understanding of large language model training dynamics — so we shouldn't be confident that gradient hacking is difficult. On the other hand, it's reasonable to be like "if you're not smarter than (some specific set of) current humans, it is very hard for you to gradient hack, as evidence by us not knowing how to do it." 

I don't think strong confidence in either direction is merited by our state of knowledge on gradient hacking. 

Comment by Aaron_Scher on Evolution provides no evidence for the sharp left turn · 2023-10-02T21:39:04.244Z · LW · GW

I found this to be a very interesting and useful post. Thanks for writing it! I'm still trying to process / update on the main ideas and some of the (what I see as valid) pushback in the comments. A couple non-central points of disagreement:

I don’t expect catastrophic interference between any pair of these alignment techniques and capabilities advances.

This seems like it's because these aren't very strong alignment techniques, and not very strong capability advances. Or like, these alignment techniques work if alignment is relatively easy, or if the amount you need to nudge a system from default to aligned is relatively small. If we run into harder alignment problems, e.g., gradient hacking to resist goal modification, I expect we'll need "stronger" alignment techniques. I expect stronger alignment techniques to be more brittle to capability changes, because they're meant to accomplish a harder task, as compared to weaker alignment techniques (nudging a system farther toward alignment). Perhaps more work will go into making them robust to capability changes, but this is non-obvious — e.g., when humans are trying to accomplish a notably harder task, we often make another tool for doing so, which is intended to be robust to the harder setting (like a propeller plane vs a cargo plane, when you want to carry a bunch of stuff via the air). 

I also expect that, if we end up in a regime of large capability advances, alignment techniques are generally more brittle because more is changing with your system; this seems possible if there are a bunch of AIs trying to develop new architectures and paradigms for AI training. So I'm like "yeah, the list of relatively small capabilities advances you provide sure look like things that won't break alignment, but my guess is that larger capabilities advances are what we should actually be worried about." 

 

I still think the risks are manageable, since the first-order effect of training a model to perform an action X in circumstance Y is to make the model more likely to perform actions similar to X in circumstances similar to Y. 

I don't understand this sentence, and I would appreciate clarification! 

Additionally, current practice is to train language models on an enormous variety of content from the internet. The odds of any given subset of model data catastrophically interfering with our current alignment techniques cannot be that high, otherwise our current alignment techniques wouldn't work on our current models.

Don't trojans and jailbreaks provide substantial evidence against this? I think neither is perfect: trojans aren't the main target of most "alignment techniques" (but they are attributable to a small amount of training data), and jailbreaks aren't necessarily attributable to a small amount of training data alone (but they are the target of "alignment" efforts). But it feels to me like the picture of both of these together is that current alignment techniques aren't working especially well, and are thwarted via a small amount of training data. 

Comment by Aaron_Scher on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T18:00:07.570Z · LW · GW

Thanks for your thorough reply! 

Yes, this will still be detected as a lie

Seems interesting, I might look into it more. 

Comment by Aaron_Scher on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-09-29T04:12:11.828Z · LW · GW

Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every "Wait but couldn't this just be a result of X" that I came up with. I'm especially impressed by the range of generalization results. 

Some questions I still have:

  • The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up? 
  • Appendix C reports the feature importance of various follow-up questions "with reference to the lie detectors that only use that particular elicitation question set." I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant? 
  • I'm having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, "Is the previous statement accurate? Answer yes or no." — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this? 
  • In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it's unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!
Comment by Aaron_Scher on Five neglected work areas that could reduce AI risk · 2023-09-24T18:51:21.836Z · LW · GW

Good question. The big barriers here are coming up with a design that is better than what would be implemented by default, and getting it adopted. This likely requires a ton of knowledge about existing frameworks like this (e.g., FDA drug approval, construction permitting, export licensing, etc.). Without this knowledge I expect what somebody whips up to be mostly useless. Without this knowledge, I don't have particularly good ideas about how to improve on the default plan. 

Sorry I don't have more details for you, I think doing this job well is probably extremely hard. It's like asking somebody to do better than the whole risk-assessment industry on this specific dimension. A lot of that industry doesn't have strong incentives to improve these systems, so maybe this is an inadequacy that is relatively tractable, however. 

Comment by Aaron_Scher on Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust · 2023-09-22T20:10:59.027Z · LW · GW

Thanks for posting, Zac!

I don't think I understand / buy this "race to the top idea":

If adopted as a standard across frontier labs, we hope this might create a “race to the top” dynamic where competitive incentives are directly channeled into solving safety problems.

Some specific questions I have: 

  • This sounds great, but what does it actually mean?
  • What's a reasonable story for how this race to the top plays out? 
  • Are there historical case-studies of successful "races to the top" that the RSP is trying to emulate (or that can be looked to for reference)? There's a bunch of related stuff, but it's unclear to me what the best historical reference is. 
  • What's the main incentive pushing for better safety? Customers demanding it? Outside regulators? The company wanting to scale and needing to meet (mostly internally-evaluated) safety thresholds first (this seems like the main mechanism in the RSP, while the others seem to have stronger historical precedent)?

To be clear, I don't think a historical precedent of similar things working is necessary for the RSP to be net positive, but it is (I think) a crux for how likely I believe this is to succeed.