Posts

AISN #38: Supreme Court Decision Could Limit Federal Ability to Regulate AI Plus, “Circuit Breakers” for AI systems, and updates on China’s AI industry 2024-07-09T19:28:29.338Z
UC Berkeley course on LLMs and ML Safety 2024-07-09T15:40:00.920Z
AI Safety Newsletter #37: US Launches Antitrust Investigations Plus, recent criticisms of OpenAI and Anthropic, and a summary of Situational Awareness 2024-06-18T18:07:45.904Z
AISN #36: Voluntary Commitments are Insufficient Plus, a Senate AI Policy Roadmap, and Chapter 1: An Overview of Catastrophic Risks 2024-06-05T17:45:25.261Z
AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data 2024-05-16T14:29:21.683Z
AISN #34: New Military AI Systems Plus, AI Labs Fail to Uphold Voluntary Commitments to UK AI Safety Institute, and New AI Policy Proposals in the US Senate 2024-05-02T16:12:47.783Z
AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI 2024-04-12T16:10:57.837Z
AISN #32: Measuring and Reducing Hazardous Knowledge in LLMs Plus, Forecasting the Future with LLMs, and Regulatory Markets 2024-03-07T16:39:56.027Z
AISN #31: A New AI Policy Bill in California Plus, Precedents for AI Governance and The EU AI Office 2024-02-21T21:58:34.000Z
AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes 2024-01-24T19:38:33.461Z
AISN #29: Progress on the EU AI Act Plus, the NY Times sues OpenAI for Copyright Infringement, and Congressional Questions about Research Standards in AI Safety 2024-01-04T16:09:31.336Z
AISN #28: Center for AI Safety 2023 Year in Review 2023-12-23T21:31:40.767Z
AISN #27: Defensive Accelerationism, A Retrospective On The OpenAI Board Saga, And A New AI Bill From Senators Thune And Klobuchar 2023-12-07T15:59:11.622Z
AISN #26: National Institutions for AI Safety, Results From the UK Summit, and New Releases From OpenAI and xAI 2023-11-15T16:07:37.216Z
AISN #25: White House Executive Order on AI, UK AI Safety Summit, and Progress on Voluntary Evaluations of AI Risks 2023-10-31T19:34:54.837Z
AISN #24: Kissinger Urges US-China Cooperation on AI, China's New AI Law, US Export Controls, International Institutions, and Open Source AI 2023-10-18T17:06:54.364Z
AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering 2023-10-04T17:37:19.564Z
AISN #22: The Landscape of US AI Legislation - Hearings, Frameworks, Bills, and Laws 2023-09-19T14:44:22.945Z
Uncovering Latent Human Wellbeing in LLM Embeddings 2023-09-14T01:40:24.483Z
MLSN: #10 Adversarial Attacks Against Language and Vision Models, Improving LLM Honesty, and Tracing the Influence of LLM Training Data 2023-09-13T18:03:30.253Z
AISN #21: Google DeepMind’s GPT-4 Competitor, Military Investments in Autonomous Drones, The UK AI Safety Summit, and Case Studies in AI Policy 2023-09-05T15:03:00.177Z
AISN #20: LLM Proliferation, AI Deception, and Continuing Drivers of AI Capabilities 2023-08-29T15:07:03.215Z
Risks from AI Overview: Summary 2023-08-18T01:21:25.445Z
AISN #19: US-China Competition on AI Chips, Measuring Language Agent Developments, Economic Analysis of Language Model Propaganda, and White House AI Cyber Challenge 2023-08-15T16:10:17.594Z
AISN #17: Automatically Circumventing LLM Guardrails, the Frontier Model Forum, and Senate Hearing on AI Oversight 2023-08-01T15:40:20.222Z
AISN #16: White House Secures Voluntary Commitments from Leading AI Labs and Lessons from Oppenheimer 2023-08-01T15:39:47.841Z
AISN #16: White House Secures Voluntary Commitments from Leading AI Labs and Lessons from Oppenheimer 2023-07-25T16:58:44.528Z
AISN#15: China and the US take action to regulate AI, results from a tournament forecasting AI risk, updates on xAI’s plan, and Meta releases its open-source and commercially available Llama 2 2023-07-19T13:01:00.939Z
AISN#14: OpenAI’s ‘Superalignment’ team, Musk’s xAI launches, and developments in military AI use 2023-07-12T16:58:05.183Z
Cost-effectiveness of professional field-building programs for AI safety research 2023-07-10T18:28:36.677Z
Cost-effectiveness of student programs for AI safety research 2023-07-10T18:28:18.073Z
Modeling the impact of AI safety field-building programs 2023-07-10T18:27:24.807Z
AISN #13: An interdisciplinary perspective on AI proxy failures, new competitors to ChatGPT, and prompting language models to misbehave 2023-07-05T15:33:19.699Z
Catastrophic Risks from AI #6: Discussion and FAQ 2023-06-27T23:23:58.846Z
Catastrophic Risks from AI #5: Rogue AIs 2023-06-27T22:06:11.029Z
AISN #12: Policy Proposals from NTIA’s Request for Comment and Reconsidering Instrumental Convergence 2023-06-27T17:20:55.185Z
Catastrophic Risks from AI #4: Organizational Risks 2023-06-26T19:36:41.333Z
Catastrophic Risks from AI #3: AI Race 2023-06-23T19:21:07.335Z
Catastrophic Risks from AI #2: Malicious Use 2023-06-22T17:10:08.374Z
Catastrophic Risks from AI #1: Introduction 2023-06-22T17:09:40.883Z
AISN #9: Statement on Extinction Risks, Competitive Pressures, and When Will AI Reach Human-Level? 2023-06-06T16:10:19.093Z
AI Safety Newsletter #8: Rogue AIs, how to screen for AI risks, and grants for research on democratic governance of AI 2023-05-30T11:52:31.669Z
Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures 2023-05-30T09:05:25.986Z
Is Deontological AI Safe? [Feedback Draft] 2023-05-27T16:39:25.556Z
AI Safety Newsletter #7: Disinformation, Governance Recommendations for AI labs, and Senate Hearings on AI 2023-05-23T21:47:34.755Z
The Polarity Problem [Draft] 2023-05-23T21:05:34.567Z
AI Safety Newsletter #6: Examples of AI safety progress, Yoshua Bengio proposes a ban on AI agents, and lessons from nuclear arms control 2023-05-16T15:14:45.921Z
Aggregating Utilities for Corrigible AI [Feedback Draft] 2023-05-12T20:57:03.712Z
AI Safety Newsletter #5: Geoffrey Hinton speaks out on AI risk, the White House meets with AI labs, and Trojan attacks on language models 2023-05-09T15:26:55.978Z
AI Safety Newsletter #4: AI and Cybersecurity, Persuasive AIs, Weaponization, and Geoffrey Hinton talks AI risks 2023-05-02T18:41:43.144Z

Comments

Comment by Dan H (dan-hendrycks) on Fabien's Shortform · 2024-06-22T03:14:31.977Z · LW · GW

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance.

people will easily find reliable jailbreaks

This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim "there are no more jailbreaks and they are solved once and for all," but I do think it's genuinely harder now.

Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison

We had the idea a few times to try out a detection-based approach but we didn't get around to it. It seems possible that it'd perform similarly if it's leaning on the various things we did in the paper. (Obviously probing has been around but people haven't gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don't really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that's tuned into the model; maybe they could be stacked. Maybe will run a detector probe this weekend and update the paper with results if everything goes well. If we do find that it works, I think it'd be unfair to desscribe this after the fact as "overselling results and using fancy techniques that don't improve on simpler techniques" as done for RMU.

My main disagreement is with the hype.

We're not responsible for that. Hype is inevitable for most established researchers. Mediocre big AI company papers get lots of hype. Didn't even do customary things like write a corresponding blog post yet. I just tweeted the paper and shared my views in the same tweet: I do think jailbreak robustness is looking easier than expected, and this is affecting my priorities quite a bit.

Aims to do unlearning in a way that removes knowledge from LLMs

Yup that was the aim for the paper and for method development. We poked at the method for a whole month after the paper's release. We didn't find anything, though in that process I slowly reconceptualized RMU as more of a circuit-breaking technique and something that's just doing a bit of unlearning. It's destroying some key function-relevant bits of information that can be recovered, so it's not comprehensively wiping. IDK if I'd prefer unlearning (grab concept and delete it) vs circuit-breaking (grab concept and put an internal tripwire around it); maybe one will be much more performant than the other or easier to use in practice. Consequently I think there's a lot to do in developing unlearning methods (though I don't know if they'll be preferable to the latter type of method).

overselling results and using fancy techniques that don't improve on simpler techniques

This makes it sound like the simplification was lying around and we deliberately made it more complicated, only to update it to have a simpler forget term. We compare to multiple baselines, do quite a bit better than them, do enough ablations to be accepted at ICML (of course there are always more you could want), and all of our numbers are accurate. We could have just included the dataset without the method in the paper, and it would have still got news coverage (Alex Wang who is a billionaire was on the paper and it was on WMDs).

Probably the only time I chose to use something a little more mathematically complicated than was necessary was the Jensen-Shannon loss in AugMix. It performed similarly to doing three pairwise l2 distances between penultimate representations, but this was more annoying to write out. Usually I'm accused of doing papers that are on the simplistic side (sometimes papers like the OOD baseline paper caused frustration because it's getting credit for something very simple) since I don't optimize for cleverness, and my collaborators know full well that I discourage trying to be clever since it's often anticorrelated with performance.

Not going to check responses because I end up spending too much time typing for just a few viewers.

Comment by Dan H (dan-hendrycks) on What do coherence arguments actually prove about agentic behavior? · 2024-06-02T17:41:22.537Z · LW · GW

Key individuals that the community is structured around just ignored it, so it wasn't accepted as true. (This is a problem with small intellectual groups.)

Comment by Dan H (dan-hendrycks) on Buck's Shortform · 2024-05-27T07:32:52.393Z · LW · GW

Some years ago we wrote that "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries" and discussed monitoring systems that can create "AI tripwires could help uncover early misaligned systems before they can cause damage." https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I've updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.

Comment by Dan H (dan-hendrycks) on Introducing AI Lab Watch · 2024-05-05T16:59:03.957Z · LW · GW

Various comments:

I wouldn't call this "AI lab watch." "Lab" has the connotation that these are small projects instead of multibillion dollar corporate behemoths.

"deployment" initially sounds like "are they using output filters which harm UX in deployment", but instead this seems to be penalizing organizations if they open source. This seems odd since open sourcing is not clearly bad right now. The description also makes claims like "Meta release all of their weights"---they don't release many image/video models because of deepfakes, so they are doing some cost-benefit analysis. Zuck: "So we want to see what other people are observing, what we’re observing, what we can mitigate, and then we'll make our assessment on whether we can make it open source." If this is mainly a penalty against open sourcing the label should be clearer.

"Commit to do pre-deployment risk assessment" They've all committed to this in the WH voluntary commitments and I think the labs are doing things on this front.

"Do risk assessment" These companies have signed on to WH voluntary commitments so are all checking for these things, and the EO says to check for these hazards too. This is why it's surprising to see Microsoft have 1% given that they're all checking for these hazards.

Looking at the scoring criteria, this seems highly fixated on rogue AIs, but I understand I'm saying that to the original forum of these concerns. Risk assessment's scoring doesn't really seem to prioritize bio x-risk as much as scheming AIs. This is strange because if we're focused on rogue AIs I'd put a half the priority of risk mitigation while the model is training. Many rogue AI people may think half of the time the AI will kill everyone is when the model is "training" (because it will escape during that time).

The first sentence of this site says the focus is on "extreme risks" but it seems the focus is mainly on rogue AIs. This should be upfront that this is from the perspective that loss of control is the main extreme risk, rather than positioning itself as a comprehensive safety tracker. If I were tracking rogue AI risks, I'd probably drill down to what they plan to do with automated AI R&D/intelligence explosions.

"Training" This seems to give way more weight to rogue AI stuff. Red teaming is actually assessable, but instead you're giving twice the points to if they have someone "work on scalable oversight." This seems like an EA vibes check rather than actually measuring something. This also seems like triple counting since it's highly associated with the "scalable alignment" section and the "alignment program" section. This doesn't even require that they use the technique for the big models they train and deploy. Independently, capabilities work related to building superintelligences can easily be framed as scalable oversight, so this doesn't set good incentives. Separately, at the end this also gives lots of points for voluntary (read: easily breakable) commitments. These should not be trusted and I think the amount of lipservice points is odd.

"Security" As I said on EAF the security scores are suspicious to me and even look backward. The major tech companies have much more experience protecting assets (e.g., clouds need to be highly secure) than startups like Anthropic and OpenAI. It takes years building up robust information security and the older companies have a sizable advantage.

"internal governance" scores seem odd. Older, larger institutions such as Microsoft and Google have many constraints and processes and don't have leaders who can unilaterally make decisions as easily, compared to startups. Their CEOs are also more fireable (OpenAI), and their board members aren't all selected by the founder (Anthropic). This seems highly keyed into if they are just a PBC or non-profit. In practice PBC just makes it harder to sue, but Zuck has such control of his company that getting successfully sued for not upholding his fiduciary duty to shareholders seems unlikely. It seems 20% of the points is not using non-disparagement agreements?? 30% is for whistleblower policies; CA has many whistleblower protections if I recall correctly. No points for a chief risk officer or internal audit committee?

"Alignment program" "Other labs near the frontier publish basically no alignment research" Meta publishes dozens of papers they call "alignment"; these actually don't feel that dissimilar to papers like Constitutional AI-like papers (https://twitter.com/jaseweston/status/1748158323369611577 https://twitter.com/jaseweston/status/1770626660338913666 https://arxiv.org/pdf/2305.11206 ). These papers aren't posted to LW but they definitely exist. To be clear I think this is general capabilities but this community seems to think differently. Alignment cannot be "did it come from EA authors" and it probably should not be "does it use alignment in its title." You'll need to be clear how this distinction is drawn.

Meta has people working on safety and CBRN+cyber + adversarial robustness etc. I think they're doing a good job (here are two papers from the last month: https://arxiv.org/pdf/2404.13161v1 https://arxiv.org/pdf/2404.16873).

As is, I think this is a little too quirky and not ecumenical enough for it to generate social pressure.

There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we've noticed some AI companies to be much more antagonistic than others. I think is is probably a larger differentiator for an organization's goodness or badness.

(Won't read replies since I have a lot to do today.)

Comment by Dan H (dan-hendrycks) on Refusal in LLMs is mediated by a single direction · 2024-04-28T01:56:32.223Z · LW · GW

is novel compared to... RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer

In the RepE paper we target multiple layers as well.

Test on many different models

The paper used Vicuna, the notebook used Llama 2. Throughout the paper we showed the general approach worked on many different models.

Describe a way of turning this into a weight-edit

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE).

Comment by Dan H (dan-hendrycks) on Refusal in LLMs is mediated by a single direction · 2024-04-28T00:40:50.812Z · LW · GW

but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section.

I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself.

Otherwise there's a race to the bottom and it makes sense to post daily research notes and flag plant that way. This increases pressure on researchers further.

including refusal-bypassing-related ones

The prior work that is covered in the document is generally less related (fine-tuning removal of safeguards, truth directions) compared to these directly relevant ones. This is an unusual citation pattern and gives the impression that the artifact is making more progress/advancing understanding than it actually is.

I'll note pretty much every time I mention something isn't following academic standards on LW I get ganged up on and I find it pretty weird. I've reviewed, organized, and can be senior area chair at ML conferences and know the standards well. Perhaps this response is consistent because it feels like an outside community imposing things on LW.

Comment by Dan H (dan-hendrycks) on Refusal in LLMs is mediated by a single direction · 2024-04-27T21:14:37.168Z · LW · GW

From Andy Zou:

Thank you for your reply.

Model interventions to bypass refusal are not discussed in Section 6.2.

We perform model interventions to robustify refusal (your section on “Adding in the "refusal direction" to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.

we examined Section 6.2 carefully before writing our work

Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity)

Ablating vs. Addition

We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn't find it was worth the complication. We look forward to any results suggesting a strong improvement.)

--

Please reach out to Andy if you want to talk more about this.

Edit: The work is prior art (it's been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we're asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to -18 very quickly).

Edit 2: On X, Neel "agree[s] it's highly relevant" and that he'll cite it. Assuming it's covered fairly and reasonably, this resolves the situation.

Edit 3: I think not citing it isn't a big deal because I think of LW as a place for ml research rough drafts, in which errors will happen. But if some are thinking it's at the level of an academic artifact/is citable content/is an expectation others cite it going forward, then failing to mention extremely similar results would actually be a bigger deal. Currently I'll think it's the former.

Comment by Dan H (dan-hendrycks) on Refusal in LLMs is mediated by a single direction · 2024-04-27T18:18:44.355Z · LW · GW

From Andy Zou:

Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper's repository which shows that adding a "harmlessness" direction to a model's representation can effectively jailbreak the model.

Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.

Comment by Dan H (dan-hendrycks) on A Gentle Introduction to Risk Frameworks Beyond Forecasting · 2024-04-12T06:36:13.689Z · LW · GW

If people are interested, many of these concepts and others are discussed in the context of AI safety in this publicly available chapter: https://www.aisafetybook.com/textbook/4-1

Comment by Dan H (dan-hendrycks) on On Complexity Science · 2024-04-05T20:58:01.612Z · LW · GW

Here is a chapter from an upcoming textbook on complex systems with discussion of their application to AI safety: https://www.aisafetybook.com/textbook/5-1

Comment by Dan H (dan-hendrycks) on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-14T17:18:41.332Z · LW · GW

> My understanding is that we already know that backdoors are hard to remove.

We don't actually find that backdoors are always hard to remove!

We did already know that backdoors often (from the title) "Persist Through Safety Training." This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn't establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.

I think it's very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)

I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words,  AGI threat scenario "window dressing," or when players from an EA-coded group research a topic. (I've been suggesting more attention to backdoors since maybe 2019; here's a video from a few years ago about the topic; we've also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don't have the window dressing.

I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It's a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).

A request: Could Anthropic employees not call supervised fine-tuning and related techniques "safety training?" OpenAI/Anthropic have made "alignment" in the ML community become synonymous with fine-tuning, which is a big loss. Calling this "alignment training" consistently would help reduce the watering down of the word "safety."

Comment by Dan H (dan-hendrycks) on Machine Unlearning Evaluations as Interpretability Benchmarks · 2023-10-23T20:53:43.425Z · LW · GW

I agree that this is an important frontier (and am doing a big project on this).

Comment by Dan H (dan-hendrycks) on Broken Benchmark: MMLU · 2023-08-30T02:11:13.222Z · LW · GW

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

Comment by Dan H (dan-hendrycks) on AI Forecasting: Two Years In · 2023-08-21T16:53:18.063Z · LW · GW

The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn't an ideological statement about what sort of alignment strategies we want.

Comment by Dan H (dan-hendrycks) on AI Forecasting: Two Years In · 2023-08-21T16:52:01.907Z · LW · GW

I think there's a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say "no calculators," they're not banning mental calculation, rather specific tools.

Comment by Dan H (dan-hendrycks) on AI Forecasting: Two Years In · 2023-08-21T16:25:07.570Z · LW · GW

It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there's no calculator access. It's well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That's no longer testing problem-solving ability; it tests the ability to set up a simple script so loses nearly all the signal. Separately, the human results we collected was with a no calculator restriction. AMC/AIME exams have a no calculator restriction. There are different maths competitions that allow calculators, but there are substantially fewer quality questions of that sort.

I think MMLU+calculator is fine though since many of the exams from which MMLU draws allow calculators.

Comment by Dan H (dan-hendrycks) on AI Forecasting: Two Years In · 2023-08-20T16:47:00.514Z · LW · GW

Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn't count (this was specified some years back). However, that is an interesting paper worth checking out.

Comment by Dan H (dan-hendrycks) on Announcing Foresight Institute's AI Safety Grants Program · 2023-08-17T17:52:29.348Z · LW · GW
  1. Neurotechnology, brain computer interface, whole brain emulation, and "lo-fi" uploading approaches to produce human-aligned software intelligence

Thank you for doing this.

Comment by dan-hendrycks on [deleted post] 2023-08-07T01:39:00.438Z

There's a literature on this topic. (paper list, lecture/slides/homework)

Comment by Dan H (dan-hendrycks) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T21:28:44.893Z · LW · GW

Plug: CAIS is funding constrained.

Comment by Dan H (dan-hendrycks) on Why was the AI Alignment community so unprepared for this moment? · 2023-07-15T17:38:10.178Z · LW · GW

Why was the AI Alignment community so unprepared for engaging with the wider world when the moment finally came?

In 2022, I think it was becoming clear that there'd be a huge flood of interest. Why did I think this? Here are some reasons: I've long thought that once MMLU performance crosses a threshold, Google would start to view AI as an existential threat to their search engine, and it seemed like in 2023 that threshold would be crossed. Second, at a rich person's party, there were many highly plugged-in elites who were starting to get much more anxious about AI (this was before ChatGPT), which updated me that the tide may turn soon.

Since I believed the interest would shift so much, I changed how I spent my time a lot in 2022: I started doing substantially less technical work to instead work on outreach and orienting documents. Here are several projects I did, some for targeted for the expert community and some targeted towards the general public:

  • We ran an AI arguments writing competition.  After seeing that we could not crowdsource AI risk writing to the community through contests last year, I also started work on An Overview of Catastrophic Risks last winter.  We had a viable draft several in April, but then I decided to restructure it, which required rewriting it and making it longer. This document was partly a synthesis of the submissions from the first round of the AI arguments competition, so fortunately the competition did not go to waste. Apologies the document took so long.
  • Last summer and fall, I worked on explaining a different AI risk to a lay audience in Natural Selection Favors AIs over Humans (apparently this doom path polls much better than treacherous turn stories; I held onto the finished paper for months and waited for GPT-4's release before releasing it to have good timing).
  • X-Risk Analysis for AI Research tries to systematically articulate how to analyze AI research's relation to x-risk for a technical audience. It was my first go at writing about AI x-risk for the ML research community. I recognize this paper was around a year ahead of its time and maybe I should have held onto it to release it later.
  • Finally, after a conversation with Kelsey Piper and the aforementioned party, I was inspired to work on a textbook An Introduction to AI Safety, Ethics, and Society. This is by far the largest writing project I've been a part of.  Currently, the only way to become an AI x-risk expert is to live in Berkeley. I want to reduce this barrier as much as possible, relate AI risk to existing literatures, and let people have a more holistic understanding of AI risk (I think people should have a basic understanding of all of corrigibility, international coordination for AI, deception, etc.).  This book is not an ML PhD topics book; it's more to give generalists good models. The textbook's contents will start to be released section-by-section on a daily basis starting late this month or next month. Normally textbooks take several years to make, so I'm happy this will be out relatively quickly.

One project we only started in 2023 is newsletter, so we can't claim prescience for that.

If you want more AI risk outputs, CAIS is funding-constrained and is currently fundraising for a writer.

Comment by Dan H (dan-hendrycks) on Elon Musk announces xAI · 2023-07-14T18:06:09.813Z · LW · GW

No good deed goes unpunished. By default there would likely be no advising.

Comment by Dan H (dan-hendrycks) on Catastrophic Risks from AI #1: Introduction · 2023-06-27T04:53:18.858Z · LW · GW

A brief overview of the contents, page by page.

1: most important century and hinge of history

2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis

3: unilateralist's curse

4: bio x-risk

5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech

6: persuasive AIs and eroded epistemics

7: value lock-in and entrenched totalitarianism

8: story about bioterrorism

9: practical malicious use suggestions


10: LAWs as an on-ramp to AI x-risk

11: automated cyberwarfare -> global destablization

12: flash war, AIs in control of nuclear command and control

13: security dilemma means AI conflict can bring us to brink of extinction

14: story about flash war

15: erosion of safety due to corporate AI race

16: automation of AI research; autnomous/ascended economy; enfeeblement

17: AI development reinterpreted as evolutionary process

18: AI development is not aligned with human values but with competitive and evolutionary pressures

19: gorilla argument, AIs could easily outclass humans in so many ways

20: story about an autonomous economy

21: practical AI race suggestions


22: examples of catastrophic accidents in various industries

23: potential AI catastrophes from accidents, Normal Accidents

24: emergent AI capabilities, unknown unknowns

25: safety culture (with nuclear weapons development examples), security mindset

26: sociotechnical systems, safety vs. capabilities

27: safetywashing, defense in depth

28: story about weak safety culture

29: practical suggestions for organizational safety

30: more practical suggestions for organizational safety


31: bing and microsoft tay demonstrate how AIs can be surprisingly unhinged/difficult to steer

32: proxy gaming/reward hacking

33: goal drift

34: spurious cues can cause AIs to pursue wrong goals/intrinsification

35: power-seeking (tool use, self-preservation)

36: power-seeking continued (AIs with different goals could be uniquely adversarial)

37: deception examples

38: treacherous turns and self-awareness

39: practical suggestions for AI control

40: how AI x-risk relates to other risks

41: conclusion

Comment by Dan H (dan-hendrycks) on MetaAI: less is less for alignment. · 2023-06-16T19:03:52.572Z · LW · GW

but I'm confident it isn't trying to do this

It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.

Comment by Dan H (dan-hendrycks) on Request: stop advancing AI capabilities · 2023-05-27T06:23:16.501Z · LW · GW

successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress

Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said "known to be." Saying that it's conceivable isn't evidence they're actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.

In the DL paradigm you can't easily separate capabilities and alignment

This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren't externalities. (More discussion of on this topic here.)

forcing that separation seems to constrain us

I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with "trust me I'm making the right utility calculations and have the right empirical intuitions" which is a very unreliable standard of evidence in deep learning.

Comment by Dan H (dan-hendrycks) on The Polarity Problem [Draft] · 2023-05-24T18:40:23.345Z · LW · GW

I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)

Comment by Dan H (dan-hendrycks) on The Polarity Problem [Draft] · 2023-05-24T18:37:03.458Z · LW · GW
Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T14:00:12.489Z · LW · GW

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

 

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.


It takes months to write up these works, and since the Schmidt paper was in December, it is not obvious who was first in all senses. The usual standard is to count the time a standard-sized paper first appeared on arXiv, so the most standard sense they are first. (Inside conferences, a paper is considered prior art if it was previously published, not just if it was arXived, but outside most people just keep track of when it was arXived.) Otherwise there are arms race dynamics leading to everyone spamming snippets before doing careful, extensive science.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-15T01:56:56.579Z · LW · GW

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)

I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths.

Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including https://arxiv.org/abs/2212.04089 were tweeted on https://twitter.com/topofmlsafety and posted on https://www.reddit.com/r/mlsafety

Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T23:57:41.894Z · LW · GW

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4

In Table 3, they show in some cases task vectors can improve fine-tuned models.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T23:25:15.651Z · LW · GW

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T22:48:42.864Z · LW · GW

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.

The extent of an adequate comparison depends on the relatedness. I'm of course not saying every paper in the related works needs its own paragraph. If they're fairly similar approaches, usually there also needs to be empirical juxtapositions as well. If the difference between these papers is: we do activations, they do weights, then I think that warrants a more in-depth conceptual comparisons or, preferably, many empirical comparisons.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T22:05:46.860Z · LW · GW

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, and that one isn't any good for alignment? Or did Ludwig Schmidt (not x-risk pilled) and coauthors in Editing Models with Task Arithmetic (made public last year and is already published) come up with an idea similar to, according to a close observer, "the most impressive concrete achievement in alignment I've seen"? If so, what does that say about the need to be x-risk motivated to do relevant research, and what does this say about group epistemics/ability to spot relevant progress if it's not posted on the AF?

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-14T21:24:24.804Z · LW · GW

Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be surprised if this is the first time many readers of AF/LW have seen it, which would make this paper seem especially novel. A good related works section would more accurately and quickly communicate how novel this is. I don't think this norm is gatekeeping nor pedantic; it becomes essential when the number of papers becomes high.

The total number of cited papers throughout the paper is different from the number of papers in the related works. If a relevant paper is buried somewhere randomly in a paper and not contrasted with explicitly in the related works section, that is usually penalized in peer review.

Comment by Dan H (dan-hendrycks) on Steering GPT-2-XL by adding an activation vector · 2023-05-13T19:55:24.577Z · LW · GW

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

Comment by Dan H (dan-hendrycks) on What‘s in your list of unsolved problems in AI alignment? · 2023-03-07T20:37:45.517Z · LW · GW

Open Problems in AI X-Risk:

https://www.alignmentforum.org/s/FaEBwhhe3otzYKGQt/p/5HtDzRAk7ePWsiL2L

Comment by Dan H (dan-hendrycks) on Power-Seeking = Minimising free energy · 2023-02-23T19:18:10.183Z · LW · GW

Thermodynamics theories of life can be viewed as a generalization of Darwinism, though in my opinion the abstraction ends up being looser/less productive, and I think it's more fruitful just to talk in evolutionary terms directly.

You might find these useful:

God's Utility Function

A New Physics Theory of Life

Entropy and Life (Wikipedia)

AI and Evolution

Comment by Dan H (dan-hendrycks) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-09T15:48:53.817Z · LW · GW

"AI Safety" which often in practice means "self driving cars"

This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not been able to make a DL model completely reliable in any specific domain (even MNIST).

Of course, they weren't talking about x-risks, but neither are industry researchers using the word "alignment" today to mean they're fine-tuning a model to be more knowledgable or making models better satisfy capabilities wants (sometimes dressed up as "human values").

If you want a word that reliably denotes catastrophic risks that is also mainstream, you'll need to make catastrophic risk ideas mainstream. Expect it to be watered down for some time, or expect it not to go mainstream.

Comment by Dan H (dan-hendrycks) on Quick thoughts on "scalable oversight" / "super-human feedback" research · 2023-01-30T05:22:36.443Z · LW · GW

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in practice, the alignment teams with compute are mostly just working on this area). I'd be in favor of figuring out how to get superhuman supervision for specific things related to normative factors/human values (e.g., superhuman wellbeing supervision), but researching superhuman supervision simpliciter will be the aim of the capabilities community.

Don't worry, the capabilities community will relentlessly maximize vanilla accuracy, and we don't need to help them.

Comment by Dan H (dan-hendrycks) on A Simple Alignment Typology · 2023-01-28T16:11:20.885Z · LW · GW

Empiricists think the problem is hard, AGI will show up soon, and if we want to have any hope of solving it, then we need to iterate and take some necessary risk by making progress in capabilities while we go.

This may be so for the OpenAI alignment team's empirical researchers, but other empirical researchers note we can work on several topics to reduce risk without substantially advancing general capabilities. (As far as I can tell, they are not working on any of the following topics, rather focusing on an avenue to scalable oversight which, as instantiated, mostly serves to make models generally better at programming.)

Here are four example areas with minimal general capabilities externalities (descriptions taken from Open Problems in AI X-Risk):

Trojans - AI systems can contain “trojan” hazards. Trojaned models behave typically in most situations, but when specific secret situations are met, they reliably misbehave. For example, an AI agent could behave normally, but when given a special secret instruction, it could execute a coherent and destructive sequence of actions. In short, this area is about identifying hidden functionality embedded in models that could precipitate a treacherous turn.  Work on detecting trojans does not improve general language model or image classifier accuracy, so the general capabilities externalities are moot.

Anomaly detection - This area is about detecting potential novel hazards such as unknown unknowns, unexpected rare events, or emergent phenomena. (This can be used for tripwires, detecting proxy gaming, detecting trojans, malicious actors, possibly for detecting emergent goals.) In anomaly detection, general capabilities externalities are easy to avoid.

Power Aversion - This area is about incentivizing models to avoid gaining more power than is necessary and analyzing how power trades off with reward. This area is deliberately about measuring and making sure highly instrumentally useful/general capabilities are controlled.

Honesty - Honest AI involves creating models that only output what they hold to be true. It also involves determining what models hold to be true, perhaps by analyzing their internal representations. Honesty is a narrower concept than truthfulness and is deliberately chosen to avoid capabilities externalities, since truthful AI is usually a combination of vanilla accuracy, calibration, and honesty goals. Optimizing vanilla accuracy is optimizing general capabilities. When working towards honesty rather than truthfulness, it is much easier to avoid capabilities externalities.

More general learning resources are at this course, and more discussion of safety vs capabilities is here (summarized in this video).

Comment by Dan H (dan-hendrycks) on Deconfusing "Capabilities vs. Alignment" · 2023-01-23T23:08:09.450Z · LW · GW

For a discussion of capabilities vs safety, I made a video about it here, and a longer discussion is available here.

Comment by Dan H (dan-hendrycks) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-23T15:37:02.225Z · LW · GW

Sorry, I am just now seeing since I'm on here irregularly.

So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.
 

Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,

It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks. However, this is a very small effect compared with the effect of directly working on capabilities. In addition, hypersensitivity to any onset of x-risk proves too much. One could claim that any discussion of x-risk at all draws more attention to AI, which could hasten AI investment and the onset of x-risks. While this may be true, it is not a good reason to give up on safety or keep it known to only a select few. We should be precautious but not self-defeating.

I'm discussing "general capabilities externalities" rather than "any bad externality," especially since the former is measurable and a dominant factor in AI development. (Identifying any sort of externality can lead people to say we should defund various useful safety efforts because it can lead to a "false sense of security," which safety engineering reminds us this is not the right policy in any industry.) 

I disagree even more strongly with "honesty efforts don't have externalities:" AI systems confidently saying false statements is a major roadblock to lots of applications (e.g. any kind of deployment by Google), so this seems huge from a commercial perspective.

I distinguish between honesty and truthfulness; I think truthfulness was way too many externalities since it is too broad. For example, I think Collin et al.'s recent paper, an honesty paper, does not have general capabilities externalities. As written elsewhere,

Encouraging models to be truthful, when defined as not asserting a lie, may be desired to ensure that models do not willfully mislead their users. However, this may increase capabilities, since it encourages models to have better understanding of the world. In fact, maximally truth-seeking models would be more than fact-checking bots; they would be general research bots, which would likely be used for capabilities research. Truthfulness roughly combines three different goals: accuracy (having correct beliefs about the world), calibration (reporting beliefs with appropriate confidence levels), and honesty (reporting beliefs as they are internally represented). Calibration and honesty are safety goals, while accuracy is clearly a capability goal. This example demonstrates that in some cases, less pure safety goals such as truth can be decomposed into goals that are more safety-relevant and those that are more capabilities-relevant.

 

I agree that interpretability doesn't always have big capabilities externalities, but it's often far from zero.


To clarify, I cannot name a time a state-of-the-art model drew its accuracy-improving advancement from interpretability research. I think it hasn't had a measurable performance impact, and anecdotally empirical researchers aren't gaining insights from that the body of work which translate to accuracy improvements. It looks like a reliably beneficial research area.

It also feels like people are using "capabilities" to just mean "anything that makes AI more valuable in the short term,"

I'm taking "general capabilities" to be something like

general prediction, classification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimization, (self-)supervised learning, sequential decision making, recursive self-improvement, open-ended goals, models accessing the Internet, ...

These are extremely general instrumentally useful capabilities that improve intelligence. (Distinguish from models that are more honest, power averse, transparent, etc.) For example, ImageNet accuracy is the main general capabilities notion in vision, because it's extremely correlated with downstream performance on so many things. Meanwhile, an improvement for adversarial robustness harms ImageNet accuracy and just improves adversarial robustness measures. If it so happened that adversarial robustness research became the best way to drive up ImageNet accuracy, then the capabilities community would flood in and work on it, and safety people should then instead work on other things.

Consequently what counts at safety should be informed by how the empirical results are looking, especially since empirical phenomena can be so unintuitive or hard to predict in deep learning.

Comment by Dan H (dan-hendrycks) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-21T17:56:56.041Z · LW · GW

making them have non-causal decision theories

How does it distinctly do that?

Comment by Dan H (dan-hendrycks) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-21T16:16:52.526Z · LW · GW

Salient examples are robustness and RLHF. I think following the implied strategy---of avoiding any safety work that improves capabilities ("capability externalities")---would be a bad idea.

There are plenty of topics in robustness, monitoring, and alignment that improve safety differentially without improving vanilla upstream accuracy: most adversarial robustness research does not have general capabilities externalities; topics such as transparency, trojans, and anomaly detection do not; honesty efforts so far do not have externalities either. Here is analysis of many research areas and their externalities.

Even though the underlying goal is to improve the safety-capabilities ratio, this is not the best decision-making policy. Given uncertainty, the large incentives for making models superhuman, motivated reasoning, and competition pressures, aiming for minimal general capabilities externalities should be what influences real-world decision-making (playing on the criterion of rightness vs. decision procedure distinction).

If safety efforts are to scale to a large number of researchers, the explicit goal should be to measurably avoid general capabilities externalities rather than, say, "pursue particular general capabilities if you expect that it will help reduce risk down the line," though perhaps I'm just particularly risk-averse. Without putting substantial effort in finding out how to avoid externalities, the differentiation between safety and capabilities at many places is highly eroded, and in consequence some alignment teams are substantially hastening timelines. For example, an alignment team's InstructGPT efforts were instrumental in making ChatGPT arrive far earlier than it would have otherwise, which is causing Google to become substantially more competitive in AI and causing many billions to suddenly flow into different AGI efforts. This is decisively hastening the onset of x-risks. I think minimal externalities may be a standard that is not always met, but I think it should be more strongly incentivized.

Comment by Dan H (dan-hendrycks) on Your posts should be on arXiv · 2022-08-27T06:40:58.260Z · LW · GW

I am strongly in favor of our very best content going on arXiv. Both communities should engage more with each other.

As follows are suggestions for posting to arXiv. As a rule of thumb, if the content of a blogpost didn't take >300 hours of labor to create, then it probably should not go on arXiv. Maintaining a basic quality bar prevents arXiv from being overriden by people who like writing up many of their inchoate thoughts; publication standards are different for LW/AF than for arXiv. Even if a researcher spent many hours on the project, arXiv moderators do not want research that's below a certain bar. arXiv moderators have reminded some professors that they will likely reject papers at the quality level of a Stanford undergraduate team project (e.g., http://cs231n.stanford.edu/2017/reports.html); consequently labor, topicality, and conforming to formatting standards is not sufficient for arXiv approval. Usually one's first research project won't be good enough for arXiv. Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv's .CY section. For more technical deep learning content, do not make the mistake of only putting it on .AI; these should probably go on .LG (machine learning) or .CV (computer vision) or .CL (NLP). arXiv's .ML section is for more statistical/theoretical machine learning audiences. For content to be approved without complications, it should likely conform to standard (ICLR, ICML, NeurIPS, CVPR, ECCV, ICCV,  ACL, EMNLP) formatting. This means automatic blogpost exporting is likely not viable. In trying to diffuse ideas to the broader ML community, we should avoid making the arXiv moderators mad at us.

Comment by Dan H (dan-hendrycks) on Your posts should be on arXiv · 2022-08-27T06:25:12.548Z · LW · GW

Here's a continual stream of related arXiv papers available through reddit and twitter.

https://www.reddit.com/r/mlsafety/

https://twitter.com/topofmlsafety

Comment by Dan H (dan-hendrycks) on Your posts should be on arXiv · 2022-08-27T06:22:51.944Z · LW · GW

I should say formatting is likely a large contributing factor for this outcome. Tom Dietterich, an arXiv moderator, apparently had a positive impression of the content of your grokking analysis. However, research on arXiv will be more likely to go live if it conforms to standard (ICLR, NeurIPS, ICML) formatting and isn't a blogpost automatically exported into a TeX file.

Comment by Dan H (dan-hendrycks) on Safetywashing · 2022-07-01T16:46:48.066Z · LW · GW

This is why we introduced X-Risk Sheets, a questionnaire that researchers should include in their paper if they're claiming that their paper reduces AI x-risk. This way researchers need to explain their thinking and collect evidence that they're not just advancing capabilities.

We now include these x-risk sheets in our papers. For example, here is an example x-risk sheet included in an arXiv paper we put up yesterday.

Comment by Dan H (dan-hendrycks) on Deepmind's Gopher--more powerful than GPT-3 · 2021-12-12T07:20:06.373Z · LW · GW

Note I'm mainly using this as an opportunity to talk about ideas and compute in NLP.

I don't know how big an improvement DeBERTaV2 is over SoTA.

DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa's high performance isn't an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.

Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that's been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it's >400x smaller.

I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.

For tasks when there is less traction, ideas are even more useful. Just to use a recent example, "the use of verifiers results in approximately the same performance boost as a 30x model size increase." I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn't help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious "low hanging fruits" because it might take a while for the community to get oriented and find good angles of attack.

Comment by Dan H (dan-hendrycks) on Deepmind's Gopher--more powerful than GPT-3 · 2021-12-09T05:55:46.220Z · LW · GW

RE: "like I'm surprised if a clever innovation does more good than spending 4x more compute"

Earlier this year, DeBERTaV2 did better on SuperGLUE than models 10x the size and got state of the art.

Models such as DeBERTaV3 can do better than on commonsense question answering tasks than models that are tens or several hundreds of times larger.

DeBERTaV3-large

Accuracy: 84.6   1  Parameters: 0.4B

T5-11B

Accuracy: 83.5  1  Parameters: 11B

Fine-tuned GPT-3

73.0  1  175B

https://arxiv.org/pdf/2112.03254.pdf#page=5

Bidirectional models + training ideas + better positional encoding helped more than 4x.