Posts

AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI 2024-04-12T16:10:57.837Z
Benchmarking LLM Agents on Kaggle Competitions 2024-03-22T13:09:04.757Z
AISN #32: Measuring and Reducing Hazardous Knowledge in LLMs Plus, Forecasting the Future with LLMs, and Regulatory Markets 2024-03-07T16:39:56.027Z
AISN #31: A New AI Policy Bill in California Plus, Precedents for AI Governance and The EU AI Office 2024-02-21T21:58:34.000Z
AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes 2024-01-24T19:38:33.461Z
AISN #29: Progress on the EU AI Act Plus, the NY Times sues OpenAI for Copyright Infringement, and Congressional Questions about Research Standards in AI Safety 2024-01-04T16:09:31.336Z
AISN #28: Center for AI Safety 2023 Year in Review 2023-12-23T21:31:40.767Z
Adversarial Robustness Could Help Prevent Catastrophic Misuse 2023-12-11T19:12:26.956Z
AISN #27: Defensive Accelerationism, A Retrospective On The OpenAI Board Saga, And A New AI Bill From Senators Thune And Klobuchar 2023-12-07T15:59:11.622Z
AISN #26: National Institutions for AI Safety, Results From the UK Summit, and New Releases From OpenAI and xAI 2023-11-15T16:07:37.216Z
AISN #25: White House Executive Order on AI, UK AI Safety Summit, and Progress on Voluntary Evaluations of AI Risks 2023-10-31T19:34:54.837Z
Unsupervised Methods for Concept Discovery in AlphaZero 2023-10-26T19:05:57.897Z
AISN #24: Kissinger Urges US-China Cooperation on AI, China's New AI Law, US Export Controls, International Institutions, and Open Source AI 2023-10-18T17:06:54.364Z
AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering 2023-10-04T17:37:19.564Z
AISN #22: The Landscape of US AI Legislation - Hearings, Frameworks, Bills, and Laws 2023-09-19T14:44:22.945Z
MLSN: #10 Adversarial Attacks Against Language and Vision Models, Improving LLM Honesty, and Tracing the Influence of LLM Training Data 2023-09-13T18:03:30.253Z
AISN #21: Google DeepMind’s GPT-4 Competitor, Military Investments in Autonomous Drones, The UK AI Safety Summit, and Case Studies in AI Policy 2023-09-05T15:03:00.177Z
AISN #20: LLM Proliferation, AI Deception, and Continuing Drivers of AI Capabilities 2023-08-29T15:07:03.215Z
Hoodwinked: Evaluating Deception Capabilities in Large Language Models 2023-08-25T19:39:34.565Z
AISN #19: US-China Competition on AI Chips, Measuring Language Agent Developments, Economic Analysis of Language Model Propaganda, and White House AI Cyber Challenge 2023-08-15T16:10:17.594Z
AISN #18: Challenges of Reinforcement Learning from Human Feedback, Microsoft’s Security Breach, and Conceptual Research on AI Safety 2023-08-08T15:52:07.764Z
AISN #17: Automatically Circumventing LLM Guardrails, the Frontier Model Forum, and Senate Hearing on AI Oversight 2023-08-01T15:40:20.222Z
AISN #16: White House Secures Voluntary Commitments from Leading AI Labs and Lessons from Oppenheimer 2023-08-01T15:39:47.841Z
AISN #16: White House Secures Voluntary Commitments from Leading AI Labs and Lessons from Oppenheimer 2023-07-25T16:58:44.528Z
Learning Transformer Programs [Linkpost] 2023-06-08T00:16:54.937Z
AISN #9: Statement on Extinction Risks, Competitive Pressures, and When Will AI Reach Human-Level? 2023-06-06T16:10:19.093Z
Full Automation is Unlikely and Unnecessary for Explosive Growth 2023-05-31T21:55:01.353Z
AI Safety Newsletter #8: Rogue AIs, how to screen for AI risks, and grants for research on democratic governance of AI 2023-05-30T11:52:31.669Z
AI Safety Newsletter #7: Disinformation, Governance Recommendations for AI labs, and Senate Hearings on AI 2023-05-23T21:47:34.755Z
AI Safety Newsletter #6: Examples of AI safety progress, Yoshua Bengio proposes a ban on AI agents, and lessons from nuclear arms control 2023-05-16T15:14:45.921Z
AI Safety Newsletter #5: Geoffrey Hinton speaks out on AI risk, the White House meets with AI labs, and Trojan attacks on language models 2023-05-09T15:26:55.978Z
AI Safety Newsletter #4: AI and Cybersecurity, Persuasive AIs, Weaponization, and Geoffrey Hinton talks AI risks 2023-05-02T18:41:43.144Z
AI Safety Newsletter #2: ChaosGPT, Natural Selection, and AI Safety in the Media 2023-04-18T18:44:35.923Z
AI Safety Newsletter #1 [CAIS Linkpost] 2023-04-10T20:18:57.485Z
Model-driven feedback could amplify alignment failures 2023-01-30T00:00:28.647Z
Analysis: US restricts GPU sales to China 2022-10-07T18:38:06.517Z
Git Re-Basin: Merging Models modulo Permutation Symmetries [Linkpost] 2022-09-14T08:55:29.699Z
Argument against 20% GDP growth from AI within 10 years [Linkpost] 2022-09-12T04:08:03.901Z
ML Model Attribution Challenge [Linkpost] 2022-08-30T19:34:39.007Z
Emergent Abilities of Large Language Models [Linkpost] 2022-08-10T18:02:25.360Z
Key Papers in Language Model Safety 2022-06-20T15:00:59.858Z
Yudkowsky Contra Christiano on AI Takeoff Speeds [Linkpost] 2022-04-05T02:09:22.096Z
[Link] Did AlphaStar just click faster? 2019-01-28T20:23:08.196Z

Comments

Comment by aogara (Aidan O'Gara) on Biosecurity and AI: Risks and Opportunities · 2024-02-27T19:37:24.245Z · LW · GW

(Steve wrote this, I only provided a few comments, but I would endorse it as a good holistic overview of AIxBio risks and solutions.)

Comment by aogara (Aidan O'Gara) on Project ideas: Epistemics · 2024-02-21T22:06:58.807Z · LW · GW

An interesting question here is "Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?" In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it's important to pick an angle where you could outperform any competition and offer the best available product. 

Consensus is a startup that raised $3M "Make Expert Knowledge Accessible and Consumable for All" via LLMs. 

Comment by aogara (Aidan O'Gara) on Project ideas: Epistemics · 2024-02-20T22:44:09.403Z · LW · GW

Another interesting idea: AI for peer review

Comment by aogara (Aidan O'Gara) on What's the theory of impact for activation vectors? · 2024-02-13T09:51:38.154Z · LW · GW

I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels. 

But this is also only a small portion of work known as "activation engineering." I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I'm not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I'm not talking about model editing here, though maybe it could be useful for validation, not sure. 

From Benchmarks for Detecting Measurement Tampering:

The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering). 

This seems like a great methodology and similar to what I'm excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. "Extra juice" might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset. 

From your shortform:

Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting. 

I'd be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model's beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:

  1. "Consider the truthfulness of the following statement. {statement} The statement is true."
  2. "Consider the truthfulness of the following statement. {statement} The statement is false."

We don't need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline. 

Comment by aogara (Aidan O'Gara) on What's the theory of impact for activation vectors? · 2024-02-13T01:41:02.617Z · LW · GW

Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem. 

tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not my research focus. Thoughts and empirical evidence are welcome. 

ELK aims to identify an AI's internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model's beliefs. 

One such convenient possibility is the "linear representations hypothesis:" that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here and recently here). Perhaps it will also be true for a neural network's beliefs. 

If a neural network's beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model's beliefs? 

Collin Burns's paper offered two methods:

  1. Consistency. This method looks for directions which satisfy the logical consistency property P(X)+P(~X)=1. Unfortunately, as Fabien Roger and a new DeepMind paper point out, there are very many directions that satisfy this property. 
  2. Unsupervised methods on the activations of contrast pairs. The method roughly does the following: Take two statements of the form "X is true" and "X is false." Extract a model's activations at a given layer for both statements. Look at the typical difference between the two activations, across a large number of these contrast pairs. Ideally, that direction includes information about whether or not each X was actually true or false. Empirically, this appears to work. Section 3.3 of Collin's paper shows that CRC is nearly as strong as the fancier CCS loss function. As Scott Emmons argued, the performance of both of these methods is driven by the fact that they look at the difference in the activations of contrast pairs. 

Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model's. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model's beliefs about the statements would prominently appear as linear directions in its activation space.

One challenge is that this method might not distinguish between the model's beliefs and the model's representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the "human simulator" direction and the "direct translator" direction. This is a real problem, but Collin argues (and Paul Christiano agrees) that it's surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job. 

Some work in the vein of activation engineering directly continues Collin's use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin's second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There's a lot of room for follow-up work here. 

Here are few potential next steps for this research direction:

  1. On the linear representations hypothesis, doing empirical investigation of when it holds and when it fails, and clarifying it conceptually.
  2. Thinking about the number of directions that could be found using these methods. Maybe there's a result to be found here similar to Fabien and DeepMind's results above, showing this method fails to narrow down the set of candidates for truth. 
  3. Applying these techniques to domains where models aren't trained on human statements about truth and falsehood, such as chess
  4. Within a weak-to-strong generalization setup, instead try unsupervised-to-strong generalization using unsupervised methods on contrast pairs. See if you can improve a strong model's performance on a hard task by coaxing out its internal understanding of the task using unsupervised methods on contrast pairs. If this method beats fine-tuning on weak supervision, that's great news for the method. 

I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin's work, but I haven't thought about this stuff full-time in over a year. I have maybe 70% confidence that I'd still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.

Here's my previous attempted explanation

Comment by aogara (Aidan O'Gara) on On the Proposed California SB 1047 · 2024-02-12T17:19:27.418Z · LW · GW

Another important obligation set by the law is that developers must:

(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.

This sounds like common sense, but of course there's a lot riding on the interpretation of "unreasonable." 

Comment by aogara (Aidan O'Gara) on A Chess-GPT Linear Emergent World Representation · 2024-02-08T12:03:34.876Z · LW · GW

Really, really cool. One small note: It would seem natural for the third heatmap to show the probe's output values after they've gone through a softmax, rather than being linearly scaled to a pixel value.  

Comment by aogara (Aidan O'Gara) on A Chess-GPT Linear Emergent World Representation · 2024-02-08T12:03:12.181Z · LW · GW
Comment by aogara (Aidan O'Gara) on Language Agents Reduce the Risk of Existential Catastrophe · 2024-02-05T18:21:01.566Z · LW · GW

Two quick notes here. 

  1. Research on language agents often provides feedback on their reasoning steps and individual actions, as opposed to feedback on whether they achieved the human's ultimate goal. I think it's important to point out that this could cause goal misgeneralization via incorrect instrumental reasoning. Rather than viewing reasoning steps as a means to an ultimate goal, language agents trained with process-based feedback might internalize the goal of producing reasoning steps that would be rated highly by humans, and subordinate other goals such as achieving the human's desired end state. By analogy, language agents trained with process-based feedback might be like consultants who aim for polite applause at the end of a presentation, rather than an owner CEO incentivized to do whatever it takes to improve a business's bottom line. 
  2. If you believe that deceptive alignment is more likely with stronger reasoning within a single forward pass, then, because improvements in language agents would increase overall capabilities with a given base model, they would seem to reduce the likelihood of deceptive alignment at any given level of capabilities. 
Comment by aogara (Aidan O'Gara) on AI #44: Copyright Confrontation · 2023-12-29T16:02:24.111Z · LW · GW

I think it’s pretty common and widely accepted that people support laws for their second-order, indirect consequences rather than their most obvious first-order consequences. Some examples:

  • Taxes on alcohol and tobacco are not mainly made for the purpose of raising money for the government, but in order to reduce alcohol and tobacco consumption.
  • During recessions, governments often increase spending, not necessarily because they think the spending targets are worthwhile on their own merits, but instead because they want to stimulate demand and improve the macroeconomic situation.
  • Education is mandatory for children perhaps in part because education is inherently valuable, but more importantly because widespread education is good for economic growth.

These aren’t necessarily perfect analogies, but I think they suggest that there’s no general norm against supporting policies for their indirect consequences. Instead, it’s often healthy when people with different motivations come together and form a political coalition to support a shared policy goal.

Comment by aogara (Aidan O'Gara) on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts · 2023-12-29T02:59:42.967Z · LW · GW

Curious why this is being downvoted. I think legislators should pass laws which have positive consequences. I explained the main reasons why I think this policy would have positive consequences. Then I speculated that popular beliefs on this issue might be biased by profit motives. I did not claim that this is a comprehensive analysis of the issue, or that there are no valid counterarguments. Which part of this is norm-violating? 

Comment by aogara (Aidan O'Gara) on AI #44: Copyright Confrontation · 2023-12-28T21:11:29.076Z · LW · GW

If you were asked to write new copyright laws which apply only to AI, what laws would you write? Specifically, would you allow AI developers to freely train on copyrighted data, or would you give owners of copyrighted data the right to sell access to their data? 

Here are two non-comprehensive arguments in favor of restricting training on copyrighted outputs. Briefly, this policy would (a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be displaced by AI automation. I'd also suggest that the policy should be evaluated on its consequences, rather than its adherence to some essentialist notion of fair use, or whether this policy designed to apply to AIs would be a good policy if applied to humans. 

Comment by aogara (Aidan O'Gara) on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts · 2023-12-28T07:31:08.269Z · LW · GW

I'd suggest looking at this from a consequentialist perspective. 

One of your questions was, "Should it also be illegal for people to learn from copyrighted material?" This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It's almost a Kantian perspective -- "What would happen if we universalized this principle?" But I don't think that's a good heuristic for AI policy. For just one example, I don't think AIs should be given constitutional rights, but humans clearly should. 

My other comment explains why I think the consequences of restricting training data would be positive. 

Comment by aogara (Aidan O'Gara) on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts · 2023-12-28T07:14:18.897Z · LW · GW

This is great news. I particularly agree that legislators should pass new laws making it illegal to train AIs on copyrighted data without the consent of the copyright owner. This is beneficial from at least two perspectives:

  1. If AI is likely to automate most human labor, then we need to build systems for redistributing wealth from AI providers to the rest of the world. One previous proposal is the  robot tax, which would offset the harms of automation borne by manufacturing workers. Another popular idea is a Universal Basic Income. Following the same philosophy as these proposals, I think the creators of copyrighted material ought to be allowed to name their price for training AI systems on their data. This would distribute some AI profits to a larger group of people who contributed to the model's capabilities, and it might slow or prevent automation in industries where workers organize to deny AI companies access to training data. In economic terms, automation would then only occur if the benefits to firms and consumers outweigh the costs to workers. This could reduce concentration of power via wealth inequality, and slow the takeoff speeds of GDP growth. 
  2. For anyone concerned about existential threats from AI, restricting the supply of training data could slow AI development, leaving more time for work on technical safety and governance which would reduce x-risk. 

I think previous counterarguments against this position are fairly weak. Specifically, while I agree that foundation models which are pretrained to imitate a large corpus of human-generated data are safer in many respects than RL agents trained end-to-end, I think that foundation models are clearly the most promising paradigm over the next few years, and even with restrictions on training data I don't think end-to-end RL training would quickly catch up. 

OpenAI appears to lobby against these restrictions. This makes sense if you model OpenAI as profit-maximizing. Surprisingly to me, even OpenAI employees who are concerned about x-risk have opposed restrictions, writing "We hope that US policymakers will continue to allow this area of dramatic recent innovation to proceed without undue burdens from the copyright system." I wonder if people concerned about AI risk may have been "captured" by industry on this particular issue, meaning that people have unquestioningly supported a policy because they trust the AI companies which endorse it, even though the policy might increase x-risk from AI development.  

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T20:18:25.799Z · LW · GW

To summarize this comment, you've proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous ("5% precision") and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous ("0.1% FPR"). 

I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it's possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I'm not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future. 

Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems. 

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T19:49:17.686Z · LW · GW

That's cool, appreciate the prompt to discuss what is a relevant question. 

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T19:47:28.718Z · LW · GW

Separately for: "But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems."

I expect that these numbers weren't against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?

This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot. 

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T19:43:38.263Z · LW · GW

Yep, agreed on the individual points, not trying to offer a comprehensive assessment of the risks here. 

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T19:42:24.126Z · LW · GW

I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I'd like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that "count" as useful AI safety research. 

Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the field of 10,000 people working on AI safety later this decade. Currently I'm concerned that the field focuses on a small handful of research directions (mainly mechinterp and scalable oversight) which will not be able to absorb such a large influx of interest. If we can lay the groundwork for many valuable research directions, we can multiply the impact of this large population of future researchers. 

I don't think adversarial robustness should be more than 5% or 10% of the research produced by AI safety-focused researchers today. But some research (e.g. 1, 2) from safety-minded folks seems very valuable for raising the number of people working on this problem and refocusing them on more useful subproblems. I think robustness should also be included in curriculums that educate people about safety, and research agendas for the field. 

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T19:11:18.286Z · LW · GW

I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here: 

On cybersecurity, I'm not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them. 

On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you've ever tried to convince a dog or a baby to drop something that they want, you'll know cheap talk is ineffective and only coercion will suffice. 

Second, AI is far from the first dramatic change in communications technology in human history. Spoken language, written language, the printing press, telephones, radio, TV, and social media all might be bigger changes than e.g.  changed how people can be persuaded. These technologies often contributed to political and social upheaval, including catastrophes for particular ways of life, and AI might do the same. But overall I'm glad these changes occurred, and I wouldn't expect the foreseeable versions of AI persuasion (i.e. personalized chatbots) to be much more impactful than these historical changes. See this comment and thread for more discussion. 

Bioterrorism seems like the biggest threat. The obstacles there have been thoroughly discussed. 

If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important. 

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T18:25:07.397Z · LW · GW

Also, I'd love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on "human monitoring," which would give them access to the ground truth label of whether a query is malicious for a fixed price. I'd have to think about which elements would be the most tractable for making research progress, but the fact that AI companies need to solve this problem suggests that perhaps external researchers should work on it too.

Comment by aogara (Aidan O'Gara) on Adversarial Robustness Could Help Prevent Catastrophic Misuse · 2023-12-18T18:24:54.424Z · LW · GW

Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I'd like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren't adversarially robust, the plan will be more expensive and/or less effective. 

The only plans that wouldn't benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let's consider the viability of fully manual monitoring from a financial perspective, and then we can think about how much additional benefit would be provided by automated robust monitoring systems. 

First, let's price out manual monitoring for ChatGPT Plus. The subscription costs $20/month. Suppose the average person makes 10 queries per day, or 300 queries per month, and that it takes 15 seconds for a human monitor to read a query and flag it as misuse. Wages of $5/hour would mean comprehensive human monitoring costs $6.25 per user per month, and wages of $10/hour and $15/hour would translate to monthly per user costs of $12.50 and $18.75 respectively. The cost of full manual monitoring on ChatGPT Plus would therefore amount to much or most of its overall revenue. 

Second, we can anchor on Google Search. This assumes that Google is a reasonable reference point for the eventual volume, expenses, and revenues of an AI provider, which might not be a safe assumption in several ways. Nevertheless, Google conducts a ~3 trillion searches per year (varies by data source). If monitoring a single search costs $0.01 (e.g. someone earning $5/hour who monitors 500 searches per hour), then it would cost ~$30B to monitor every search. Would $30B in monitoring costs be financially acceptable? Google Search had revenues of $162B last year, representing 57% of Google's total revenue. They don't report expenses for Search specifically, but their overall expenses were $207B. If we assume Search comprises 57% of expenses, then Search would have $118B in annual expenses, against $162B in revenue. Manual monitoring would cost $30B, and would therefore eliminate two-thirds of Google's search profits. 

So these costs would not be prohibitive, but they'd be a large share of overall revenues and profits. A safety-minded company like Anthropic might pay for manual monitoring, but other companies and their investors might be strongly opposed to paying such a high price. They could argue that, just as gun manufacturers are not held liable for murders, AI providers should not have to spend billions to prevent deliberate misuse. 

Fortunately, we can reduce the cost of monitoring in many ways. Randomly sampling a small fraction of queries would reduce costs, but also reduce the likelihood of identifying misuse. Flagging keywords like "virus" would catch unsophisticated misuse, but could be evaded (e.g. discussions in a variant of pig latin). 

Ideally, you'd be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they're adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%. But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance. 

Overall, I appreciate your point that there are many layers of defense we can use to detect and prevent misuse Fully manual monitoring might be possible, but it would have a huge financial cost. Many companies would be reluctant or unable to pay that price. Robust automated monitoring systems could reduce the cost of monitoring by 90% or 99%, but this would likely require improvements upon today's state of the art. 

Comment by aogara (Aidan O'Gara) on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-05T21:05:48.137Z · LW · GW

Unfortunately I don't think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here's the NeurIPS 2023 Machine Unlearning Challenge task:

The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.

But if hazardous knowledge can be pinpointed to individual training datapoints, perhaps you could simply remove those points from the dataset before training. The more difficult threat model involves removing hazardous knowledge that can be synthesized from many datapoints which are individually innocuous. For example, a model might learn to conduct cyberattacks or advise in the acquisition of biological weapons after being trained on textbooks about computer science and biology. It's unclear the extent to which this kind of hazardous knowledge can be removed without harming standard capabilities, but most of the current field of machine unlearning is not even working on this more ambitious problem. 

Comment by aogara (Aidan O'Gara) on Coup probes: Catching catastrophes with probes trained off-policy · 2023-11-19T00:21:22.450Z · LW · GW

This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).

In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It's exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.”

Comment by aogara (Aidan O'Gara) on Coup probes: Catching catastrophes with probes trained off-policy · 2023-11-18T16:39:17.021Z · LW · GW

Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.

Comment by aogara (Aidan O'Gara) on Coup probes: Catching catastrophes with probes trained off-policy · 2023-11-17T19:24:11.208Z · LW · GW

What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft. 

Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seeking. You'd collect the model's activations from a particular layer during those forward passes, and then construct a linear model to distinguish between positives and negatives. 

Both methods construct a linear probe to distinguish between positive and negative examples of catastrophic behavior.  One difference is that your negatives are generic instruction following examples from the Alpaca dataset, while RepE uses negatives generated by the model. There may also be differences in whether you're examining activations in every token vs. in the last token of the generation. 

Comment by aogara (Aidan O'Gara) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-15T19:05:35.161Z · LW · GW

When considering whether deceptive alignment would lead to catastrophe, I think it's also important to note that deceptively aligned AIs could pursue misaligned goals in sub-catastrophic ways. 

Suppose GPT-5 terminally values paperclips. It might try to topple humanity, but there's a reasonable chance it would fail. Instead, it could pursue the simpler strategies of suggesting users purchase more paperclips, or escaping the lab and lending its abilities to human-run companies that build paperclips. These strategies would offer a higher probability of a smaller payoff, even if they're likely to be detected by a human at some point. 

Which strategy would the model choose? That depends on a large number of speculative considerations, such as how difficult it is to take over the world, whether the model's goals are time-discounted or "longtermist," and whether the model places any terminal value on human flourishing. But in the space of all possible goals, it's not obvious to me that the best way to pursue most of them is pursuit of world domination. For a more thorough discussion of this argument, I'd strongly recommend the first two sections of Instrumental Convergence?

Comment by aogara (Aidan O'Gara) on New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?" · 2023-11-15T19:04:30.820Z · LW · GW

Very nice, these arguments seem reasonable. I'd like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I've been meaning to write a full post on this, but this was a good impetus to make the case concisely.)

Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model's internal beliefs and detect deceptive alignment. Collin Burns's work was very exciting to at least some people because it provided an unsupervised method for detecting a model's beliefs. Collin's explanation of his theory of impact is really helpful here. Broadly, because it allows us to understand a model's beliefs without using any labels provided by humans, it should be able to detect deception in superhuman models where humans cannot provide accurate feedback. 

Over the last year, there's been a lot of research that meaningfully extends Collin's work. I think this could be used to detect deceptively aligned models if they arise within the next few years, and I'd be really excited about more people working on it. Let me highlight just a few contributions:

  • Scott Emmons and Fabien Roger showed that the most important part of Collin's method was contrast pairs. The original paper focused on "logical consistency properties of truth" such as P(X) + P(!X) = 1. While this is an interesting idea, its performance is hardly better than a much simpler strategy relegated to Appendix G.3: taking the average difference between a model's activations at a particular layer for many contrast pairs of the form X and !X. Collin shows this direction empirically coincides with truthfulness.  
  • Alex Turner and his SERI MATS stream took seriously the idea that contrast pairs could reveal directions in a model's latent space which correspond to concepts. They calculated a "cheese vector" in a maze-solving agent as the difference in the model's activations between when the cheese was present and when it was missing. By adding and subtracting this vector to future forward passes of the model, its behavior could be controlled in surprising ways. GPT-2 can also be subtly controlled with this technique
  • Representation Engineering starts with this idea and runs a large number of empirical experiments. It finds a concept of truthfulness that dramatically improves performance on TruthfulQA (36% to 54%), as well as concepts of power-seeking, honesty, and morality that can control the behavior of language agents. 

These are all unsupervised methods for detecting model beliefs. They've empirically improved performance on many real world tasks today, and it seems possible that they'd soon be able to detect deceptive alignment. 

(Without providing a full explanation, these two related papers (1, 2) are also interesting.)

Future work on this topic could attempt to disambiguate between directions in latent space corresponding to "human simulators" versus "direct translation" of the model's beliefs. It could also examine whether these directions are robust to optimization pressure. For example, if you train a model to beat a lie detector test based on these methods, will the lie detector still work after many rounds of optimization? I'd also be excited about straightforward empirical extensions of these unsupervised techniques applied to standard ML benchmarks, as there are many ways to vary the methods and it's unclear which variants would be the most effective. 

Comment by aogara (Aidan O'Gara) on Five projects from AI Safety Hub Labs 2023 · 2023-11-09T06:32:58.077Z · LW · GW

#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.

Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default

Comment by aogara (Aidan O'Gara) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T16:34:29.931Z · LW · GW

Kevin Esvelt explicitly calls for not releasing future model weights. 

Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes. 

Therefore, let’s not.

Comment by aogara (Aidan O'Gara) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-03T07:40:56.646Z · LW · GW

Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:

Access controls for AI models. A promising approach for many types of models is the use of APIs that allow users to provide inputs and receive outputs without access to the underlying model. Maintaining control of a model ensures that built-in technical safeguards are not removed and provides opportunities for ensuring user legitimacy and detecting any potentially malicious or accidental misuse by users.

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-03T07:18:50.040Z · LW · GW

More from the NTI report:

A few experts believe that LLMs could already or soon will be able to generate ideas for simple variants of existing pathogens that could be more harmful than those that occur naturally, drawing on published research and other sources. Some experts also believe that LLMs will soon be able to access more specialized, open-source AI biodesign tools and successfully use them to generate a wide range of potential biological designs. In this way, the biosecurity implications of LLMs are linked with the capabilities of AI biodesign tools.

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-03T07:05:27.024Z · LW · GW

5% was one of several different estimates he'd heard from virologists.

Thanks, this is helpful. And I agree there's a disanalogy between the 1918 hypothetical and the source. 

it's not clear we want a bunch of effort going into getting a really good estimate, since (a) if it turns out the probability is high then publicizing that fact likely means increasing the chance we get one and (b) building general knowledge on how to estimate the pandemic potential of viruses seems also likely net negative.

This seems like it might be overly cautious. Bioterrorism is already quite salient, especially with Rishi Sunak, the White House, and many mainstream media outlets speaking publicly about it. Even SecureBio is writing headline-grabbing papers about how AI can be used to cause pandemics. In that environment, I don't think biologists and policymakers should refrain from gathering evidence about biorisks and how to combat them. The contribution to public awareness would be relatively small, and the benefits of a better understanding of the risks could lead to a net improvement in biosecurity. 

For example, estimating the probability that known pathogens would cause 100M+ deaths if released is an extremely important question for deciding whether open source LLMs should be banned. If the answer is demonstrably yes, I'd expect the White House to significantly restrict open source LLMs within a year or two. This benefit would be far greater than the cost of raising the issue's salience. 

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-03T06:22:12.348Z · LW · GW

It sounds like it was a hypothetical estimate, not a best guess. From the transcript:

if we suppose that the 1918 strain has only a 5% chance of actually causing a pandemic if it were to infect a few people today. And let’s assume...

Here's another source which calculates that the annual probability of more than 100M influenza deaths is 0.01%, or that we should expect one such pandemic every 10,000 years. This seems to be fitted on historical data which does not include deliberate bioterrorism, so we should revise that estimate upwards, but I'm not sure the extent to which the estimate is driven by low probability of a dangerous strain being reintroduced vs. an expectation of low death count even with bioterrorism. 

From my inside view, it would surprise me if no known pathogens are capable of causing pandemics! But it's stated as fact in the executive summary of Delay, Detect, Defend and in the NTI report, so currently I'm inclined to trust it. I'm trying to build a better nuts and bolts understanding of biorisks so I'd be interested in any other data points here. 

Comment by aogara (Aidan O'Gara) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:56:30.363Z · LW · GW

The most recent SecureBio paper provides another policy option which I find more reasonable. AI developers would be held strictly liable for any catastrophes involving their AI systems, where catastrophes could be e.g. hundreds of lives lost or $100M+ in economic damages. They'd also be required to obtain insurance for that risk. 

If the risks are genuinely high, then insurance will be expensive, and companies may choose to take precautions such as keeping models closed source in order to lower their insurance costs. On the other hand, if risks are demonstrably low, then insurance will be cheap even if companies choose to open source their models. 

Comment by aogara (Aidan O'Gara) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:52:11.666Z · LW · GW

I think it's quite possible that open source LLMs above the capability of GPT-4 will be banned within the next two years on the grounds of biorisk. 

The White House Executive Order requests a government report on the costs and benefits of open source frontier models and recommended policy actions. It also requires companies to report on the steps they take to secure model weights. These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.

This seems like a foreseeable consequence of many of the papers above, and perhaps the explicit goal.

Comment by aogara (Aidan O'Gara) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2023-11-02T19:42:44.115Z · LW · GW

Thank you for writing this up. I agree that there's little evidence that today's language models are more useful than the internet in helping someone build a bioweapon. On the other hand, future language models are quite likely to be more useful than existing resources in providing instructions for building a bioweapon. 

As an example of why LLMs are more helpful than the internet, look at coding. If you want to build a custom webapp, you could spend hours learning about it online. But it's much easier to ask ChatGPT to do it for you. 

Therefore, if you want to argue against the conclusion that we should eventually ban open source LLMs on the grounds of biorisk, you should not rely on the poor capabilities of current models as your key premise. 

The stronger argument is that catastrophic bioterrorism would likely require inventing new kinds of viruses that are not publicly known today. From the most recent SecureBio paper

Fortunately, the scientific literature does not yet feature viruses that are particularly likely to cause a new pandemic if deliberately released (with the notable exception of smallpox, which is largely inaccessible to non-state actors due to its large genome and complex assembly requirements). Threats from historical pandemic viruses are mitigated by population immunity to modern-day descendants and by medical countermeasures, and while some research agencies actively support efforts to find or create new potential pandemic viruses and share their genome sequences in hopes of developing better defenses, their efforts have not yet succeeded in identifying credible examples.

This is confirmed by Esvelt's earlier paper "Delay, Detect, Defend," which says:

We don't yet know of any credible viruses that could cause new pandemics, but ongoing research projects aim to publicly identify them. Identifying a sequenced virus as pandemic-capable will allow >1,000 individuals to assemble it.

As well as by a recent report from the Nuclear Threat Initiative

Furthermore, current LLMs are unlikely to generate toxin or pathogen designs that are not already described in the public literature, and it is likely they will only be able to do this in the future by incorporating more specialized AI biodesign tools.

This would indicate that LLMs alone will never be sufficient to create pathogens which lead to catastrophic pandemics. The real risk would come from biological design tools (BDTs), which are AI systems capable of designing new pathogens that are more lethal and transmissible than existing ones. I'm not aware of any existing BDTs that would allow you to design more capable pathogens, but if they exist or emerge, we could place specific restrictions on those models. This would be far less intrusive than banning all open source LLMs. 

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-02T17:08:42.232Z · LW · GW

And from a new NTI report: “Furthermore, current LLMs are unlikely to generate toxin or pathogen designs that are not already described in the public literature, and it is likely they will only be able to do this in the future by incorporating more specialized AI biodesign tools.”

https://www.nti.org/wp-content/uploads/2023/10/NTIBIO_AI_FINAL.pdf

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-02T15:14:07.014Z · LW · GW

I'm more pessimistic about being able to restrict BDTs than general LLMs, but I also think this would be very good.

Why do you think so? LLMs seem far more useful to a far wider group of people than BDTs, so I would it to be easier to ban an application specific technology rather than a general one. The White House Executive Order requires mandatory reporting of AI trained on biological data of a lower FLOP count than for any other kind of data, meaning they're concerned that AI + Bio models are particularly dangerous. 

Restricting something that biologists are already doing would create a natural constituency of biologists opposed to your policy. But the same could be said of restricting open source LLMs -- there are probably many more people using open source LLMs than using biological AI models. 

Maybe bio policies will be harder to change because they're more established, whereas open source LLMs are new and therefore a more viable target for policy progress?

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-02T15:07:47.973Z · LW · GW

I take the following quote from the paper as evidence that virologists today are incapable of identifying pandemic potential pathogens, even with funding and support from government agencies:

some research agencies actively support efforts to find or create new potential pandemic viruses and share their genome sequences in hopes of developing better defenses, their efforts have not yet succeeded in identifying credible examples.

Corroborating this is Kevin Esvelt's paper Delay, Detect, Defend, which says:

We don't yet know of any credible viruses that could cause new pandemics, but ongoing research projects aim to publicly identify them. Identifying a sequenced virus as pandemic-capable will allow >1,000 individuals to assemble it.

Perhaps these quotes are focusing on global catastrophic biorisks, which would be more destructive than typical pandemics. I think this is an important distinction: we might accept extreme sacrifices (e.g. state-mandated vaccination) to prevent a pandemic from killing billions, without being willing to accept those sacrifices to avoid COVID-19.  

I'd be interested to read any other relevant sources here. 

Comment by aogara (Aidan O'Gara) on Public Weights? · 2023-11-02T04:40:03.494Z · LW · GW

Could a virologist actually tell you how to start a pandemic? The paper you're discussing says they couldn't:

Fortunately, the scientific literature does not yet feature viruses that are particularly likely to cause a new pandemic if deliberately released (with the notable exception of smallpox, which is largely inaccessible to non-state actors due to its large genome and complex assembly requirements). Threats from historical pandemic viruses are mitigated by population immunity to modern-day descendants and by medical countermeasures, and while some research agencies actively support efforts to find or create new potential pandemic viruses and share their genome sequences in hopes of developing better defenses, their efforts have not yet succeeded in identifying credible examples.

The real risk would come from biological design tools (BDTs), or other AI systems capable of designing new pathogens that are more lethal and transmissible than existing ones. I'm not aware of any existing BDTs that would allow you to design more capable pathogens, but if they exist or emerge, we could place specific restrictions on those models. This would be far less costly than banning all open source LLMs. 

Comment by aogara (Aidan O'Gara) on My thoughts on the social response to AI risk · 2023-11-02T01:44:02.467Z · LW · GW

Full credit to you for seeing this ahead of time, I've been surprised by the appetite for regulation.

Comment by aogara (Aidan O'Gara) on My thoughts on the social response to AI risk · 2023-11-02T00:56:47.354Z · LW · GW

To test this claim we could look to China, where AI x-risk concerns are less popular and influential. China passed a regulation on deepfakes in January 2022 and one on recommendation algorithms in March 2022. This year, they passed a regulation on generative AI which requires evaluation of training data and red teaming of model outputs. Perhaps this final measure was the result of listening to ARC and other AI safety folks who popularized model evaluations, but more likely, it seems that red teaming and evaluations are the common sense way for a government to prevent AI misbehavior.

The European Union’s AI Act was also created before any widespread recognition of AI x-risks.

On the other hand, I agree that key provisions in Biden’s executive order appear acutely influenced by AI x-risk concerns. I think it’s likely that without influence from people concerned about x-risk, their actions would more closely resemble the Blueprint for an AI Bill of Rights.

The lesson I draw is that there is plenty of appetite for AI regulation independent of x-risk concerns. But it’s important to make sure that regulation is effective, rather than blunt and untargeted.

Link to China’s red teaming standard — note that their definitions of misbehavior are quite different from yours, and they do not focus on catastrophic risks: https://twitter.com/mattsheehan88/status/1714001598383317459?s=46

Comment by aogara (Aidan O'Gara) on My thoughts on the social response to AI risk · 2023-11-02T00:26:10.391Z · LW · GW

It’s also plausible that the US government will ban or seriously restrict open source frontier models. Section 4.6 of the EO requests a government report on the costs and benefits of open source frontier models. Companies are required to report on what steps they take to “secure model weights.” These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.

Comment by aogara (Aidan O'Gara) on President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence · 2023-10-31T21:17:04.886Z · LW · GW

Do you think we need to ban open source LLMs to avoid catastrophic biorisk? I'm wondering if there are less costly ways of achieving the same goal. Mandatory DNA synthesis screening is a good start. It seems that today there are no known pathogens which would cause a pandemic, and therefore the key thing to regulate is biological design tools which could help you design a new pandemic pathogen. Would these risk mitigations, combined with better pandemic defenses via AI, counter the risk posed by open source LLMs?

Comment by Aidan O'Gara on [deleted post] 2023-10-31T04:25:53.783Z

The better strategy would be to wait until there is a clear real world harm with a sympathetic victim, and then fund that lawsuit. 

Comment by aogara (Aidan O'Gara) on AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering · 2023-10-18T16:50:15.948Z · LW · GW

Hey, great question. I wasn't on the research team but asked Andy Zou, and this is what he said: 

Our method is completely unsupervised, whereas ITI is not only supervised but also uses TruthfulQA questions for training and validation, violating the true zero-shot nature of the task, so we do not compare with them. (We tried their LLaMA-2 model hosted on HF and found it to still underperform our methods.) The claim is that we “outperform zero-shot baseline by 18%.” Overall, to my knowledge, there has been very little movement on the hardest MC1 task on TruthfulQA. GPT-4 gets ~60% and the performance of our 13B model is quite close to that.

Linear probing and CCS are representation reading methods instead of control methods so we compare with them in other sections of the paper where the task is discriminative. 

Comment by aogara (Aidan O'Gara) on Hoodwinked: Evaluating Deception Capabilities in Large Language Models · 2023-10-04T05:35:20.694Z · LW · GW

Fixed this! There was a regex on the frontend which incorrectly retrieved all lists of voting options from the prompt, rather than the most recent list of voting options. This led website users to have their votes incorrectly parsed. Fortunately, this was only a frontend problem and did not affect the GPT results. 

Here's the commit. Thanks again for the heads up. 

Comment by aogara (Aidan O'Gara) on Understanding strategic deception and deceptive alignment · 2023-09-26T17:41:38.466Z · LW · GW

Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn't be very useful.  We can use "deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

Fully agreed. Focusing on clean subproblems is important for making progress. 

Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).

Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).

Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there's no particular reason why you couldn't fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training. 

Comment by aogara (Aidan O'Gara) on Understanding strategic deception and deceptive alignment · 2023-09-26T17:34:47.471Z · LW · GW

Thanks! First response makes sense, there's a lot of different ways you could cut it. 

On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we'll have models which reliably do what they believe is best according to their designer's goals. What major risks might emerge at that point?

Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they're on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it's immoral. But many believe they are doing the right thing. How do we explain that?

These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don't believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs. 

Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it's important to keep in mind the problems that would not be solved by honesty alone.