AI Safety Newsletter #8: Rogue AIs, how to screen for AI risks, and grants for research on democratic governance of AI 2023-05-30T11:52:31.669Z
Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures 2023-05-30T09:05:25.986Z
Is Deontological AI Safe? [Feedback Draft] 2023-05-27T16:39:25.556Z
AI Safety Newsletter #7: Disinformation, Governance Recommendations for AI labs, and Senate Hearings on AI 2023-05-23T21:47:34.755Z
The Polarity Problem [Draft] 2023-05-23T21:05:34.567Z
AI Safety Newsletter #6: Examples of AI safety progress, Yoshua Bengio proposes a ban on AI agents, and lessons from nuclear arms control 2023-05-16T15:14:45.921Z
Aggregating Utilities for Corrigible AI [Feedback Draft] 2023-05-12T20:57:03.712Z
AI Safety Newsletter #5: Geoffrey Hinton speaks out on AI risk, the White House meets with AI labs, and Trojan attacks on language models 2023-05-09T15:26:55.978Z
AI Safety Newsletter #4: AI and Cybersecurity, Persuasive AIs, Weaponization, and Geoffrey Hinton talks AI risks 2023-05-02T18:41:43.144Z
AI Safety Newsletter #2: ChaosGPT, Natural Selection, and AI Safety in the Media 2023-04-18T18:44:35.923Z
[MLSN #9] Verifying large training runs, security risks from LLM access to APIs, why natural selection may favor AIs over humans 2023-04-11T16:03:31.061Z
AI Safety Newsletter #1 [CAIS Linkpost] 2023-04-10T20:18:57.485Z
Environments for Measuring Deception, Resource Acquisition, and Ethical Violations 2023-04-07T18:40:21.310Z
AI and Evolution 2023-03-30T12:56:27.280Z
$20 Million in NSF Grants for Safety Research 2023-02-28T04:44:38.417Z
There are no coherence theorems 2023-02-20T21:25:48.478Z
[MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming 2023-02-20T15:54:13.791Z
[MLSN #7]: an example of an emergent internal optimizer 2023-01-09T19:39:47.888Z
[MLSN #6]: Transparency survey, provable robustness, ML models that predict the future 2022-10-12T20:56:31.670Z
[MLSN #5]: Prize Compilation 2022-09-26T21:55:54.136Z
Announcing the Introduction to ML Safety course 2022-08-06T02:46:00.295Z
$20K In Bounties for AI Safety Public Materials 2022-08-05T02:52:47.729Z
NeurIPS ML Safety Workshop 2022 2022-07-26T15:28:52.441Z
[Linkpost] Existential Risk Analysis in Empirical Research Papers 2022-07-02T00:09:49.399Z
Paper: Forecasting world events with neural nets 2022-07-01T19:40:12.788Z
Open Problems in AI X-Risk [PAIS #5] 2022-06-10T02:08:06.121Z
[MLSN #4]: Many New Interpretability Papers, Virtual Logit Matching, Rationalization Helps Robustness 2022-06-03T01:20:44.875Z
Perform Tractable Research While Avoiding Capabilities Externalities [Pragmatic AI Safety #4] 2022-05-30T20:25:05.151Z
Complex Systems for AI Safety [Pragmatic AI Safety #3] 2022-05-24T00:00:58.873Z
Actionable-guidance and roadmap recommendations for the NIST AI Risk Management Framework 2022-05-17T15:26:23.201Z
A Bird's Eye View of the ML Field [Pragmatic AI Safety #2] 2022-05-09T17:18:53.978Z
Introduction to Pragmatic AI Safety [Pragmatic AI Safety #1] 2022-05-09T17:06:00.409Z
Introducing the ML Safety Scholars Program 2022-05-04T16:01:51.575Z
[$20K in Prizes] AI Safety Arguments Competition 2022-04-26T16:13:16.351Z
[MLSN #3]: NeurIPS Safety Paper Roundup 2022-03-08T15:17:25.643Z


Comment by Dan H (dan-hendrycks) on The Polarity Problem [Draft] · 2023-05-24T18:40:23.345Z · LW · GW

I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)

Comment by Dan H (dan-hendrycks) on The Polarity Problem [Draft] · 2023-05-24T18:37:03.458Z · LW · GW
Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T14:00:12.489Z · LW · GW

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)


Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

It takes months to write up these works, and since the Schmidt paper was in December, it is not obvious who was first in all senses. The usual standard is to count the time a standard-sized paper first appeared on arXiv, so the most standard sense they are first. (Inside conferences, a paper is considered prior art if it was previously published, not just if it was arXived, but outside most people just keep track of when it was arXived.) Otherwise there are arms race dynamics leading to everyone spamming snippets before doing careful, extensive science.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T01:56:56.579Z · LW · GW

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)

I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss I would want a weight version of a method and an activation version of a method; they tend to have different strengths.

Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including were tweeted on and posted on

Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T23:57:41.894Z · LW · GW

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text:

In Table 3, they show in some cases task vectors can improve fine-tuned models.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T23:25:15.651Z · LW · GW

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T22:48:42.864Z · LW · GW

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.

The extent of an adequate comparison depends on the relatedness. I'm of course not saying every paper in the related works needs its own paragraph. If they're fairly similar approaches, usually there also needs to be empirical juxtapositions as well. If the difference between these papers is: we do activations, they do weights, then I think that warrants a more in-depth conceptual comparisons or, preferably, many empirical comparisons.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T22:05:46.860Z · LW · GW

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, and that one isn't any good for alignment? Or did Ludwig Schmidt (not x-risk pilled) and coauthors in Editing Models with Task Arithmetic (made public last year and is already published) come up with an idea similar to, according to a close observer, "the most impressive concrete achievement in alignment I've seen"? If so, what does that say about the need to be x-risk motivated to do relevant research, and what does this say about group epistemics/ability to spot relevant progress if it's not posted on the AF?

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T21:24:24.804Z · LW · GW

Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be surprised if this is the first time many readers of AF/LW have seen it, which would make this paper seem especially novel. A good related works section would more accurately and quickly communicate how novel this is. I don't think this norm is gatekeeping nor pedantic; it becomes essential when the number of papers becomes high.

The total number of cited papers throughout the paper is different from the number of papers in the related works. If a relevant paper is buried somewhere randomly in a paper and not contrasted with explicitly in the related works section, that is usually penalized in peer review.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-13T19:55:24.577Z · LW · GW

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

Comment by Dan H (dan-hendrycks) on What‘s in your list of unsolved problems in AI alignment? · 2023-03-07T20:37:45.517Z · LW · GW

Open Problems in AI X-Risk:

Comment by Dan H (dan-hendrycks) on Power-Seeking = Minimising free energy · 2023-02-23T19:18:10.183Z · LW · GW

Thermodynamics theories of life can be viewed as a generalization of Darwinism, though in my opinion the abstraction ends up being looser/less productive, and I think it's more fruitful just to talk in evolutionary terms directly.

You might find these useful:

God's Utility Function

A New Physics Theory of Life

Entropy and Life (Wikipedia)

AI and Evolution

Comment by Dan H (dan-hendrycks) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-09T15:48:53.817Z · LW · GW

"AI Safety" which often in practice means "self driving cars"

This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not been able to make a DL model completely reliable in any specific domain (even MNIST).

Of course, they weren't talking about x-risks, but neither are industry researchers using the word "alignment" today to mean they're fine-tuning a model to be more knowledgable or making models better satisfy capabilities wants (sometimes dressed up as "human values").

If you want a word that reliably denotes catastrophic risks that is also mainstream, you'll need to make catastrophic risk ideas mainstream. Expect it to be watered down for some time, or expect it not to go mainstream.

Comment by Dan H (dan-hendrycks) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2023-01-30T05:22:36.443Z · LW · GW

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in practice, the alignment teams with compute are mostly just working on this area). I'd be in favor of figuring out how to get superhuman supervision for specific things related to normative factors/human values (e.g., superhuman wellbeing supervision), but researching superhuman supervision simpliciter will be the aim of the capabilities community.

Don't worry, the capabilities community will relentlessly maximize vanilla accuracy, and we don't need to help them.

Comment by Dan H (dan-hendrycks) on A Simple Alignment Typology · 2023-01-28T16:11:20.885Z · LW · GW

Empiricists think the problem is hard, AGI will show up soon, and if we want to have any hope of solving it, then we need to iterate and take some necessary risk by making progress in capabilities while we go.

This may be so for the OpenAI alignment team's empirical researchers, but other empirical researchers note we can work on several topics to reduce risk without substantially advancing general capabilities. (As far as I can tell, they are not working on any of the following topics, rather focusing on an avenue to scalable oversight which, as instantiated, mostly serves to make models generally better at programming.)

Here are four example areas with minimal general capabilities externalities (descriptions taken from Open Problems in AI X-Risk):

Trojans - AI systems can contain “trojan” hazards. Trojaned models behave typically in most situations, but when specific secret situations are met, they reliably misbehave. For example, an AI agent could behave normally, but when given a special secret instruction, it could execute a coherent and destructive sequence of actions. In short, this area is about identifying hidden functionality embedded in models that could precipitate a treacherous turn.  Work on detecting trojans does not improve general language model or image classifier accuracy, so the general capabilities externalities are moot.

Anomaly detection - This area is about detecting potential novel hazards such as unknown unknowns, unexpected rare events, or emergent phenomena. (This can be used for tripwires, detecting proxy gaming, detecting trojans, malicious actors, possibly for detecting emergent goals.) In anomaly detection, general capabilities externalities are easy to avoid.

Power Aversion - This area is about incentivizing models to avoid gaining more power than is necessary and analyzing how power trades off with reward. This area is deliberately about measuring and making sure highly instrumentally useful/general capabilities are controlled.

Honesty - Honest AI involves creating models that only output what they hold to be true. It also involves determining what models hold to be true, perhaps by analyzing their internal representations. Honesty is a narrower concept than truthfulness and is deliberately chosen to avoid capabilities externalities, since truthful AI is usually a combination of vanilla accuracy, calibration, and honesty goals. Optimizing vanilla accuracy is optimizing general capabilities. When working towards honesty rather than truthfulness, it is much easier to avoid capabilities externalities.

More general learning resources are at this course, and more discussion of safety vs capabilities is here (summarized in this video).

Comment by Dan H (dan-hendrycks) on Deconfusing "Capabilities vs. Alignment" · 2023-01-23T23:08:09.450Z · LW · GW

For a discussion of capabilities vs safety, I made a video about it here, and a longer discussion is available here.

Comment by Dan H (dan-hendrycks) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-23T15:37:02.225Z · LW · GW

Sorry, I am just now seeing since I'm on here irregularly.

So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.

Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,

It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks. However, this is a very small effect compared with the effect of directly working on capabilities. In addition, hypersensitivity to any onset of x-risk proves too much. One could claim that any discussion of x-risk at all draws more attention to AI, which could hasten AI investment and the onset of x-risks. While this may be true, it is not a good reason to give up on safety or keep it known to only a select few. We should be precautious but not self-defeating.

I'm discussing "general capabilities externalities" rather than "any bad externality," especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a "false sense of security," which safety engineering reminds us this is not the right policy in any industry.) 

I disagree even more strongly with "honesty efforts don't have externalities:" AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.

I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.'s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,

Encouraging models to be truthful, when defined as not asserting a lie, may be desired to ensure that models do not willfully mislead their users. However, this may increase capabilities, since it encourages models to have better understanding of the world. In fact, maximally truth-seeking models would be more than fact-checking bots; they would be general research bots, which would likely be used for capabilities research. Truthfulness roughly combines three different goals: accuracy (having correct beliefs about the world), calibration (reporting beliefs with appropriate confidence levels), and honesty (reporting beliefs as they are internally represented). Calibration and honesty are safety goals, while accuracy is clearly a capability goal. This example demonstrates that in some cases, less pure safety goals such as truth can be decomposed into goals that are more safety-relevant and those that are more capabilities-relevant.


I agree that interpretability doesn't always have big capabilities externalities, but it's often far from zero.

To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn't had a measurable performance impact, and anecdotally empirical researchers aren't gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.

It also feels like people are using "capabilities" to just mean "anything that makes AI more valuable in the short term,"

I'm taking "general capabilities" to be something like

general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, ...

These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it's extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.

Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.

Comment by Dan H (dan-hendrycks) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-21T17:56:56.041Z · LW · GW

making them have non-causal decision theories

How does it distinctly do that?

Comment by Dan H (dan-hendrycks) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-21T16:16:52.526Z · LW · GW

Salient examples are robustness and RLHF. I think following the implied strategy---of avoiding any safety work that improves capabilities ("capability externalities")---would be a bad idea.

There are plenty of topics in robustness, monitoring, and alignment that improve safety differentially without improving vanilla upstream accuracy: most adversarial robustness research does not have general capabilities externalities; topics such as transparency, trojans, and anomaly detection do not; honesty efforts so far do not have externalities either. Here is analysis of many research areas and their externalities.

Even though the underlying goal is to improve the safety-capabilities ratio, this is not the best decision-making policy. Given uncertainty, the large incentives for making models superhuman, motivated reasoning, and competition pressures, aiming for minimal general capabilities externalities should be what influences real-world decision-making (playing on the criterion of rightness vs. decision procedure distinction).

If safety efforts are to scale to a large number of researchers, the explicit goal should be to measurably avoid general capabilities externalities rather than, say, "pursue particular general capabilities if you expect that it will help reduce risk down the line," though perhaps I'm just particularly risk-averse. Without putting substantial effort in finding out how to avoid externalities, the differentiation between safety and capabilities at many places is highly eroded, and in consequence some alignment teams are substantially hastening timelines. For example, an alignment team's InstructGPT efforts were instrumental in making ChatGPT arrive far earlier than it would have otherwise, which is causing Google to become substantially more competitive in AI and causing many billions to suddenly flow into different AGI efforts. This is decisively hastening the onset of x-risks. I think minimal externalities may be a standard that is not always met, but I think it should be more strongly incentivized.

Comment by Dan H (dan-hendrycks) on Your posts should be on arXiv · 2022-08-27T06:40:58.260Z · LW · GW

I am strongly in favor of our very best content going on arXiv. Both communities should engage more with each other.

As follows are suggestions for posting to arXiv. As a rule of thumb, if the content of a blogpost didn't take >300 hours of labor to create, then it probably should not go on arXiv. Maintaining a basic quality bar prevents arXiv from being overriden by people who like writing up many of their inchoate thoughts; publication standards are different for LW/AF than for arXiv. Even if a researcher spent many hours on the project, arXiv moderators do not want research that's below a certain bar. arXiv moderators have reminded some professors that they will likely reject papers at the quality level of a Stanford undergraduate team project (e.g.,; consequently labor, topicality, and conforming to formatting standards is not sufficient for arXiv approval. Usually one's first research project won't be good enough for arXiv. Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv's .CY section. For more technical deep learning content, do not make the mistake of only putting it on .AI; these should probably go on .LG (machine learning) or .CV (computer vision) or .CL (NLP). arXiv's .ML section is for more statistical/theoretical machine learning audiences. For content to be approved without complications, it should likely conform to standard (ICLR, ICML, NeurIPS, CVPR, ECCV, ICCV,  ACL, EMNLP) formatting. This means automatic blogpost exporting is likely not viable. In trying to diffuse ideas to the broader ML community, we should avoid making the arXiv moderators mad at us.

Comment by Dan H (dan-hendrycks) on Your posts should be on arXiv · 2022-08-27T06:25:12.548Z · LW · GW

Here's a continual stream of related arXiv papers available through reddit and twitter.

Comment by Dan H (dan-hendrycks) on Your posts should be on arXiv · 2022-08-27T06:22:51.944Z · LW · GW

I should say formatting is likely a large contributing factor for this outcome. Tom Dietterich, an arXiv moderator, apparently had a positive impression of the content of your grokking analysis. However, research on arXiv will be more likely to go live if it conforms to standard (ICLR, NeurIPS, ICML) formatting and isn't a blogpost automatically exported into a TeX file.

Comment by Dan H (dan-hendrycks) on Safetywashing · 2022-07-01T16:46:48.066Z · LW · GW

This is why we introduced X-Risk Sheets, a questionnaire that researchers should include in their paper if they're claiming that their paper reduces AI x-risk. This way researchers need to explain their thinking and collect evidence that they're not just advancing capabilities.

We now include these x-risk sheets in our papers. For example, here is an example x-risk sheet included in an arXiv paper we put up yesterday.

Comment by Dan H (dan-hendrycks) on Deepmind's Gopher--more powerful than GPT-3 · 2021-12-12T07:20:06.373Z · LW · GW

Note I'm mainly using this as an opportunity to talk about ideas and compute in NLP.

I don't know how big an improvement DeBERTaV2 is over SoTA.

DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa's high performance isn't an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.

Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that's been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it's >400x smaller.

I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.

For tasks when there is less traction, ideas are even more useful. Just to use a recent example, "the use of verifiers results in approximately the same performance boost as a 30x model size increase." I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn't help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious "low hanging fruits" because it might take a while for the community to get oriented and find good angles of attack.

Comment by Dan H (dan-hendrycks) on Deepmind's Gopher--more powerful than GPT-3 · 2021-12-09T05:55:46.220Z · LW · GW

RE: "like I'm surprised if a clever innovation does more good than spending 4x more compute"

Earlier this year, DeBERTaV2 did better on SuperGLUE than models 10x the size and got state of the art.

Models such as DeBERTaV3 can do better than on commonsense question answering tasks than models that are tens or several hundreds of times larger.


Accuracy: 84.6   1  Parameters: 0.4B


Accuracy: 83.5  1  Parameters: 11B

Fine-tuned GPT-3

73.0  1  175B

Bidirectional models + training ideas + better positional encoding helped more than 4x.

Comment by Dan H (dan-hendrycks) on How to turn money into AI safety? · 2021-08-30T20:32:08.819Z · LW · GW

In safety research labs in academe, we do not have a resource edge compared to the rest of the field.

We do not have large GPU clusters, so we cannot train GPT-2 from scratch or fine-tune large language models in a reasonable amount of time.

We also do not have many research engineers (currently zero) to help us execute projects. Some of us have safety projects from over a year ago on the backlog because there are not enough reliable people to help execute the projects.

These are substantial bottlenecks that more resources could resolve.