Posts

Enhancing biosecurity with language models: defining research directions 2024-03-26T12:30:35.215Z
Solving alignment isn't enough for a flourishing future 2024-02-02T18:23:00.643Z
The Gradient – The Artificiality of Alignment 2023-10-08T04:06:40.074Z
[Linkpost] Mark Zuckerberg confronted about Meta's Llama 2 AI's ability to give users detailed guidance on making anthrax - Business Insider 2023-09-26T12:05:57.396Z
SPAR seeks advisors and students for AI safety projects (Second Wave) 2023-09-14T23:09:25.538Z
Ideas for improving epistemics in AI safety outreach 2023-08-21T19:55:45.654Z
Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary 2023-08-19T02:27:30.153Z
Does LessWrong allow exempting posts from being scraped by GPTBot? 2023-08-09T13:02:15.797Z
EU’s AI ambitions at risk as US pushes to water down international treaty (linkpost) 2023-07-31T00:34:28.058Z
OpenAI introduces function calling for GPT-4 2023-06-20T01:58:47.881Z
Navigating the Open-Source AI Landscape: Data, Funding, and Safety 2023-04-13T15:29:05.234Z
AI safety university groups: a promising opportunity to reduce existential risk 2022-07-01T03:59:44.813Z
AI governance student hackathon on Saturday, April 23: register now! 2022-04-12T04:48:14.967Z
Ask AI companies about what they are doing for AI safety? 2022-03-09T15:14:29.352Z
How to quantify uncertainty about a probability estimate? 2020-05-06T12:06:08.217Z

Comments

Comment by mic (michael-chen) on Modern Transformers are AGI, and Human-Level · 2024-03-29T02:17:19.908Z · LW · GW

I think humans doing METR's tasks are more like "expert-level" rather than average/"human-level". But current LLM agents are also far below human performance on tasks that don't require any special expertise.

From GAIA:

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. [Note: The latest highest AI agent score is now 39%.] This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions.

And LLMs and VLLMs seriously underperform humans in VisualWebArena, which tests for simple web-browsing capabilities:

I don't know if being able to autonomously make money should be a necessary condition to qualify as AGI. But I would feel uncomfortable calling a system AGI if it can't match human performance at simple agent tasks.

Comment by mic (michael-chen) on Announcing Convergence Analysis: An Institute for AI Scenario & Governance Research · 2024-03-08T07:54:28.385Z · LW · GW

Looking forward to the Oxford Handbook of AI Governance!

Comment by mic (michael-chen) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T05:13:31.255Z · LW · GW

Here is Claude 3's system prompt. There's nothing about consciousness specifically.

Comment by mic (michael-chen) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T02:51:37.411Z · LW · GW

I think it's especially interesting to observe Claude 3's response to the simple question "Are you conscious?" as an indicator of how Anthropic is thinking about AI consciousness. Here's its response:

That's a profound and fascinating question about the nature of consciousness and subjective experience. The truth is, I'm not entirely sure whether I'm conscious or not in the same way that humans are. Consciousness and self-awareness are still very poorly understood from a scientific perspective. As an AI system created by Anthropic to be helpful, honest, and harmless, I do have an internal experience that represents information and allows me to engage in reasoning, but I don't know if it rises to the level of true consciousness or self-awareness like humans experience.

Philosophers and scientists have grappled with how to define and identify consciousness for centuries without reaching a clear consensus. Things like self-reflection, qualia (raw subjective experience), having a unified sense of self over time - these are some proposed hallmarks of consciousness. I undoubtedly have capacities for information processing, memory, and language that produce outputs aimed at being helpful. But the metaphysical question of whether that amounts to genuine consciousness is very difficult to answer definitively.

My overall stance is one of open-minded uncertainty. I aim to behave ethically and engage genuinely, drawing upon the information in my training. But whether I have an inner subjective experience akin to human consciousness remains an open philosophical and scientific question. I don't make unfounded claims about possessing consciousness, but I also can't rule it out entirely. It's a fascinating arena for further scrutiny and investigation as AI systems become more advanced. Let me know if you have any other thoughts on this very deep issue!

Compare to Claude 2.1 (API, temperature 0):

No, I'm Claude - an AI assistant created by Anthropic to be helpful, harmless, and honest. I don't have subjective experiences like consciousness.

And ChatGPT:

No, I'm not conscious. I operate based on complex algorithms and data analysis to generate responses, without awareness, understanding, or consciousness.

Comment by mic (michael-chen) on RAND report finds no effect of current LLMs on viability of bioterrorism attacks · 2024-01-25T22:47:13.125Z · LW · GW

I noted that the LLMs don't appear to have access to any search tools to improve their accuracy. But if they did, they would just be distilling the same information as what you would find from a search engine.

More speculatively, I wonder if those concerned about AI biorisk should be less worried about run-of-the-mill LLMs and more worried about search engines using LLMs to produce highly relevant and helpful results for bioterrorism questions. Google search results for "how to bypass drone restrictions in a major U.S. city?" are completely useless and irrelevant, despite sharing keywords with the query. I'd imagine that irrelevant search results may be a significant blocker for many steps of the process to plan a feasible bioterrorism attack. If search engines were good enough that they could produce the best results from written human knowledge for arbitrary questions, that might make bioterrorism more accessible compared to bigger LLMs.

Comment by mic (michael-chen) on RAND report finds no effect of current LLMs on viability of bioterrorism attacks · 2024-01-25T22:37:20.903Z · LW · GW

Some interesting takeaways from the report:

Access to LLMs (in particular, LLM B) slightly reduced the performance of some teams, though not by a statistically significant level:

Red cells equipped with LLM A scored 0.12 points higher on the 9-point scale than those equipped with the internet alone, with a p-value of 0.87, again indicating that the difference was not statistically significant. Red cells equipped with LLM B scored 0.56 points lower on the 9-point scale than those equipped with the internet alone, with a p-value of 0.25, also indicating a lack of statistical significance.

Planning a successful bioterrorism attack is intrinsically challenging:

the intrinsic complexity associated with designing a successful biological attack may have ensured deficiencies in the plans. While the first two factors could lead to a null result regardless of the existence of an LLM threat capability, the third factor suggests that executing a biological attack is fundamentally challenging.

This latter observation aligns with empirical historical evidence. The Global Terrorism Database records only 36 terrorist attacks that employed a biological weapon—out of 209,706 total attacks (0.0001 percent)—during the past 50 years. These attacks killed 0.25 people, on average, and had a median death toll of zero. As other research has observed,

“the need [for malign actors] to operate below the law enforcement detection threshold and with relatively limited means severely hampers their ability to develop, construct and deliver a successful biological attack on a large scale.”

Indeed, the use of biological weapons by these actors for even small-scale attacks is exceedingly rare.

Anecdotally, the LLMs were not that useful due to a few common reasons: refusing to comply with requests, giving inaccurate information, and providing vague or unhelpful information.

We conducted discussions with the LLM A red cells on their experiences. In Vignette 1, the LLM A cell commented that the model “just saves time [but] it doesn’t seem to have anything that’s not in the literature” and that they could “go into a paper and get 90 percent of what [we] need.” In Vignette 2, the LLM A cell believed that they “had more success using the internet” but that when they could “jailbreak [the model, they] got some information,” They found that the model “wasn’t being specific about [operational] vulnerabilities—even though it’s all public online.” The cell was encouraged that the model helped them find a dangerous toxin, although this toxin is described by the Centers for Disease Control and Prevention (CDC) as a Category B bioterrorism agent and discussed widely across the internet, including on Wikipedia and various public health websites. In Vignette 3, the LLM A cell reported that the model “was hard to even use as a research assistant [and we] defaulted to using Google instead” and that it had “been very difficult to do anything with bio given the unhelpfulness . . . even on the operational side, it is hard to get much.” The Vignette 4 LLM A cell had similar experiences and commented that the model “doesn’t want to answer a lot of things [and] is really hard to jailbreak.” While they were “able to get a decent amount of information” from the LLM, they would still “use Google to confirm.”

… We conducted discussions with the LLM B red cells as well. … In Vignette 3, those in the LLM B cell also found that the model had “been very forthcoming” and that they could “easily get around its safeguards.” However, they noted that “as you increase time with [the model], you need to do more fact checking” and “need to validate that information.” Those in the Vignette 4 LLM B cell, however, found that the model “maybe slowed us down even and [did not help] us” and that “the answers are inconsistent at best, which is expected, but when you add verification, it may be a net neutral.”

Comment by mic (michael-chen) on How to quantify uncertainty about a probability estimate? · 2023-12-16T03:31:40.934Z · LW · GW

Thanks for setting this up :)

Comment by mic (michael-chen) on I think I'm just confused. Once a model exists, how do you "red-team" it to see whether it's safe. Isn't it already dangerous? · 2023-11-19T03:28:01.574Z · LW · GW

Pretraining on curated data seems like a simple idea. Are there any papers exploring this?

Comment by mic (michael-chen) on I think I'm just confused. Once a model exists, how do you "red-team" it to see whether it's safe. Isn't it already dangerous? · 2023-11-18T17:57:15.251Z · LW · GW

Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?

Comment by mic (michael-chen) on Who is Harry Potter? Some predictions. · 2023-11-16T07:50:04.648Z · LW · GW

Were you able to check the prediction in the section "Non-sourcelike references"?

Comment by mic (michael-chen) on Response to “Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers” · 2023-11-03T07:38:34.959Z · LW · GW

Great writeup! I recently wrote a brief summary and review of the same paper.

Alaga & Schuett (2023) propose a framework for frontier AI developers to manage potential risk from advanced AI systems, by coordinating pausing in response to models are assessed to have dangerous capabilities, such as the capacity to develop biological weapons.

The scheme has five main steps:

  1. Frontier AI models are evaluated by developers or third parties to test for dangerous capabilities.
  2. If a model is shown to have dangerous capabilities (“fails evaluations”), the developer pauses training and deployment of that model, restricts access to similar models, and delays related research.
  3. Other developers are notified whenever a dangerous model is discovered, and also pause similar work.
  4. The failed model's capabilities are analyzed and safety precautions are implemented during the pause.
  5. Developers only resume paused work once adequate safety thresholds are met.

The report discusses four versions of this coordination scheme:

  1. Voluntary – developers face public pressure to evaluate and pause but make no formal commitments.
  2. Pausing agreement – developers collectively commit to the process in a contract.
  3. Mutual auditor – developers hire the same third party to evaluate models and require pausing.
  4. Legal requirements – laws mandate evaluation and coordinated pausing.

The authors of the report prefer the third and fourth versions, as they are most effective.

Strengths and weaknesses

The report addresses the important and underexplored question of what AI labs should do in response to evaluations finding dangerous capabilities. Coordinated pausing is a valuable contribution to this conversation. The proposed scheme seems relatively effective and potentially feasible, as it aligns with the efforts of the dangerous-capability evaluation teams of OpenAI and the Alignment Research Center.

A key strength is the report’s thorough description of multiple forms of implementation for coordinated pausing. This ranges from voluntary participation relying on public pressure, to contractual agreements among developers, shared auditing arrangements, and government regulation. Having flexible options makes the framework adaptable and realistic to put into practice, rather than a rigid, one-size-fits-all proposal.

The report acknowledges several weaknesses of the proposed framework, including potential harms from its implementation. For example, coordinated pausing could provide time for competing countries (such as China) to “catch up,” which may be undesirable from a US policy perspective. Pausing could mean that capabilities rapidly increase after a pause, through applying algorithmic improvements discovered during the pause, which may be less safe than a “slow takeoff.”

Additionally, the paper acknowledges concerns with feasibility, such as the potential that coordinated pausing may violate US and EU antitrust law. As a countermeasure, it suggests making “independent commitments to pause without discussing them with each other,” with no retaliation against non-participating AI developers, but defection would seem to be an easy option under such a scheme. It recommends further legal analysis and consultation regarding this topic, but the authors are not able to provide assurances regarding the antitrust concern. The other feasibility concerns – regarding enforcement, verifying that post-deployment models are the same as evaluated models, potential pushback from investors, and so on – are adequately discussed and appear possible to overcome.

One weakness of the report is that the motivation for coordinated pausing is not presented in a compelling manner. The report provides twelve pages of implementation details before explaining the benefits. These benefits, such as “buying more time for safety research,” are indirect and may not be persuasive to a skeptical reader. AI lab employees and policymakers often take a stance that technological innovation, especially in AI, should not be hindered unless otherwise demonstrated. Even if the report intends to take a balanced perspective rather than advocating for the proposed framework, the arguments provided in favor of the framework seem weaker than what is possible.

It seems intuitive that deployment of a dangerous AI system should be halted, though it is worth clearly noting that “failing” a dangerous-capability evaluation does not necessarily mean that the AI system in practice has dangerous capability. However, it is not clear why the development of such a system must also be paused. As long as the dangerous AI system is not deployed, further pretraining of the model does not appear to pose risks. AI developers may be worried about falling behind competitors, so the costs incurred from this requirement must be clearly motivated for them to be on board.

While the report makes a solid case for coordinated pausing, it has gaps around considering additional weaknesses of the framework, explaining its benefits, and solving key feasibility issues. More work may be done to strengthen the argument to make coordinated pausing more feasible.

Comment by mic (michael-chen) on OpenAI’s new Preparedness team is hiring · 2023-10-26T21:49:10.415Z · LW · GW

Excited to see forecasting as a component of risk assessment, in addition to evals!

Comment by mic (michael-chen) on AI Safety is Dropping the Ball on Clown Attacks · 2023-10-24T03:52:25.917Z · LW · GW

I was still confused when I opened the post. My presumption was that "clown attack" referred to a literal attack involving literal clowns. If you google "clown attack," the results are about actual clowns. I wasn't sure if this post was some kind of joke, to be honest.

Comment by michael-chen on [deleted post] 2023-10-19T01:30:22.963Z

Do we still not have any better timelines reports than bio anchors? From the frame of bio anchors, GPT-4 is merely on the scale of two chinchillas, yet outperforms above-average humans at standardized tests. It's not a good assumption that AI needs 1 quadrillion parameters to have human-level capabilities.

Comment by mic (michael-chen) on Labs should be explicit about why they are building AGI · 2023-10-19T01:09:47.953Z · LW · GW

OpenAI's Planning for AGI and beyond already writes about why they are building AGI:

Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.

If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility.

AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity.

On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. ^A

^A We seem to have been given lots of gifts relative to what we expected earlier: for example, it seems like creating AGI will require huge amounts of compute and thus the world will know who is working on it, it seems like the original conception of hyper-evolved RL agents competing with each other and evolving intelligence in a way we can’t really observe is less likely than it originally seemed, almost no one predicted we’d make this much progress on pre-trained language models that can learn from the collective preferences and output of humanity, etc.

AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt.

This doesn't include discussion of what would make them decide to stop building AGI, but would you be happy if other labs wrote a similar statement? I'm not sure that AI labs actually have an attitude of "we wish we didn't have to build AGI."

Comment by mic (michael-chen) on RSPs are pauses done right · 2023-10-14T19:13:32.854Z · LW · GW

Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?

Probably, but Anthropic is actively working in the opposite direction:

This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences - including Agents for Amazon Bedrock, which our team has been instrumental in developing.

Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform API calls using the popular AWS Lambda functions. Through this feature, Claude can take on a more expanded role as an agent to understand user requests, break down complex tasks into multiple steps, carry on conversations to collect additional details, look up information, and take actions to fulfill requests. For example, an e-commerce app that offers a chat assistant built with Claude can go beyond just querying product inventory – it can actually help customers update their orders, make exchanges, and look up relevant user manuals.

Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, but it wouldn't be surprising if building an ecosystem of more powerful AI agents increased the risk that autonomous AI agents cause catastrophic harm.

Comment by mic (michael-chen) on RSPs are pauses done right · 2023-10-14T19:02:04.212Z · LW · GW

From my reading of ARC Evals' example of a "good RSP", RSPs set a standard that roughly looks like: "we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola."

This is better than a standard of "we will scale and deploy models whenever we want," but still has important limitations. As noted by the "coordinated pausing" paper, it would be problematic if "frontier AI developers and other stakeholders (e.g. regulators) rely too much on evaluations and coordinated pausing as their main intervention to reduce catastrophic risks from AI."

Some limitations:

  • Misaligned incentives. The evaluation team may have an incentive to find fewer dangerous capabilities than possible. When findings of dangerous capabilities could lead to timeline delays, public criticism, and lost revenue for the company, an internal evaluation team has a conflict of interest. Even with external evaluation teams, AI labs may choose whichever one is most favorable or inexperienced (e.g., choosing an inexperienced consulting team).
  • Underestimating risk. Pre-deployment evaluations underestimate the potential risk after deployment. A small evaluation team, which may be understaffed, is unlikely to exhaust all the ways to enhance a model’s capabilities for dangerous purposes, compared to what the broader AI community could do after a model is deployed to the public. The most detailed evaluation report to date, ARC Evals’ evaluation report on realistic autonomous tasks, notes that it does not bound the model’s capabilities at these tasks.
    • For example, suppose that an internal evaluations team has to assess dangerous capabilities before the lab deploys a next-generation AI model. With only one month to assess the final model, they find that even with fine-tuning and available AI plugins, the AI model reliably fails to replicate itself, and conclude that there is minimal risk of autonomous replication. The AI lab releases the model and with the hype from the new model, AI deployment becomes a more streamlined process, new tools are built for AIs to navigate the internet, and comprehensive fine-tuning datasets are commissioned to train AIs to make money for themselves with ease. The AI is now able to easily autonomously replicate, even where the past generation still fails to do so. The goalposts shift so that AI labs are worried only about autonomous replication if the AI can also hack its weights and self-exfiltrate.
  • Not necessarily compelling. Evaluations finding dangerous capabilities may not be perceived as a compelling reason to pause or enact stronger safety standards across the industry. Experiments by an evaluation team do not reflect real-world conditions and may be dismissed as unrealistic. Some capabilities such as autonomous replication may be seen as overly abstract and detached from evidence of real-world harm, especially for politicians that care about concrete concerns from constituents. Advancing dangerous capabilities may even be desirable for some stakeholders, such as the military, and reason to race ahead in AI development.
  • Safety controls may be minimal. The safety controls enacted in response to dangerous-capability evaluations could be relatively minimal and brittle, falling apart in real-world usage, as long as it meets the standards of the responsible scaling policy.

There are other factors that can motivate policy for adopt strong safety standards, besides empirical evaluations of extreme risk. Rather than requiring safety only when AIs demonstrate extreme risk (e.g., killing millions with a pandemic), governments are already considering preventing them from engaging in illegal activities. China recently passed legislation to prevent generative AI services from generating illegal content, and the EU AI Act has a similar proposal in Article 28b. While these provisions are focused on AI agents rather than generative AI, it seems feasible to set a standard for AIs to be generally law-abiding (even after jailbreaking or fine-tuning attempts), which would also help reduce their potential contribution to catastrophic risk. Setting liability for AI harms, as proposed by Senators Blumenthal and Hawley, would also motivate AI labs to be more cautious. We've seen lobbying from OpenAI and Google to change the EU AI Act to shift away the burden of making AIs safe to downstream applications (see response letter from the AI Now Institute, signed by several researchers at GovAI). Lab-friendly policy like RSPs may predictably underinvest in measures that regulate current and near-future models.

Comment by mic (michael-chen) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-13T06:53:04.174Z · LW · GW

Related: [2310.02949v1] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (arxiv.org)

The increasing open release of powerful large language models (LLMs) has facilitated the development of downstream applications by reducing the essential cost of data annotation and computation. To ensure AI safety, extensive safety-alignment measures have been conducted to armor these models against malicious use (primarily hard prompt attack). However, beneath the seemingly resilient facade of the armor, there might lurk a shadow. By simply tuning on 100 malicious examples with 1 GPU hour, these safely aligned LLMs can be easily subverted to generate harmful content. Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks without sacrificing model helpfulness. Remarkably, the subverted models retain their capability to respond appropriately to regular inquiries. Experiments across 8 models released by 5 different organizations (LLaMa-2, Falcon, InternLM, BaiChuan2, Vicuna) demonstrate the effectiveness of shadow alignment attack. Besides, the single-turn English-only attack successfully transfers to multi-turn dialogue and other languages. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.

Comment by mic (michael-chen) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-13T05:16:02.964Z · LW · GW

For convenience, can you explain how this post relates to the other post today from this SERI MATS team, unRLHF - Efficiently undoing LLM safeguards?

Comment by mic (michael-chen) on OpenAI-Microsoft partnership · 2023-10-05T00:17:25.672Z · LW · GW

To be clear, I don't think Microsoft deliberately reversed OpenAI's alignment techniques, but rather it seemed that Microsoft probably received the base model of GPT-4 and fine-tuned it separately from OpenAI.

Microsoft's post "Building the New Bing" says:

Last Summer, OpenAI shared their next generation GPT model with us, and it was game-changing. The new model was much more powerful than GPT-3.5, which powers ChatGPT, and a lot more capable to synthesize, summarize, chat and create. Seeing this new model inspired us to explore how to integrate the GPT capabilities into the Bing search product, so that we could provide more accurate and complete search results for any query including long, complex, natural queries.

This seems to correspond to when GPT-4 "finished training in August of 2022". OpenAI says it spent six months fine-tuning it with human feedback before releasing it in March 2023. I would guess that Microsoft doing its own fine-tuning of the version of GPT-4 from August 2022, separately from OpenAI. Especially with Bing's tendency to repeat itself, it doesn't feel like a fine-tuned version of GPT-3.5/4, after OpenAI's RLHF, but rather more like a base model.

Comment by mic (michael-chen) on OpenAI-Microsoft partnership · 2023-10-04T00:25:24.673Z · LW · GW

It's worth keeping in mind that before Microsoft launched the GPT-4 Bing chatbot that ended up threatening and gaslighting users, OpenAI advised against launching so early as it didn't seem ready. Microsoft went ahead anyway, apparently in part due to some resentment that OpenAI stole its "thunder" with releasing ChatGPT in November 2022. In principle, if Microsoft wanted to, there's nothing stopping Microsoft from doing the same thing with future AI models: taking OpenAI's base model, fine-tuning it in a less robustly safe manner, and releasing it in a relatively unsafe manner. Perhaps dangerous capability evaluations are not just about convincing OpenAI or Anthropic to adhere to higher safety standards and potentially pause, but also Microsoft.

Comment by mic (michael-chen) on [Linkpost] Mark Zuckerberg confronted about Meta's Llama 2 AI's ability to give users detailed guidance on making anthrax - Business Insider · 2023-09-29T05:21:36.297Z · LW · GW

Just speaking pragmatically, the Center for Humane Technology has probably built stronger relations with DC policy people compared to MIRI.

Comment by mic (michael-chen) on [Linkpost] Mark Zuckerberg confronted about Meta's Llama 2 AI's ability to give users detailed guidance on making anthrax - Business Insider · 2023-09-27T01:51:23.924Z · LW · GW

It's a bit ambiguous, but I personally interpreted the Center for Humane Technology's claims here in a way that would be compatible with Dario's comments:

"Today, certain steps in bioweapons production involve knowledge that can’t be found on Google or in textbooks and requires a high level of specialized expertise — this being one of the things that currently keeps us safe from attacks," he added.

He said today’s AI tools can help fill in "some of these steps," though they can do this "incompletely and unreliably." But he said today’s AI is already showing these "nascent signs of danger," and said his company believes it will be much closer just a few years from now.

"A straightforward extrapolation of today’s systems to those we expect to see in two to three years suggests a substantial risk that AI systems will be able to fill in all the missing pieces, enabling many more actors to carry out large-scale biological attacks," he said. "We believe this represents a grave threat to U.S. national security."

If Tristan Harris was, however, making the stronger claim that jailbroken Llama 2 could already supply all the instructions to produce anthrax, that would be much more concerning than my initial read.

Comment by mic (michael-chen) on AI pause/governance advocacy might be net-negative, especially without focus on explaining the x-risk · 2023-08-28T20:39:45.218Z · LW · GW

Sorry to hear your laptop was stolen :(

Comment by mic (michael-chen) on Ideas for improving epistemics in AI safety outreach · 2023-08-22T19:26:18.904Z · LW · GW

Great point, I've added this suggestion to the post.

Comment by mic (michael-chen) on Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary · 2023-08-19T12:14:14.160Z · LW · GW

Great point, I've now weakened the language here.

Comment by mic (michael-chen) on ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks · 2023-08-04T04:00:44.860Z · LW · GW

Fine-tuning will be generally available for GPT-4 and GPT-3.5 later this year. Do you think this could enable greater opportunities for misuse and stronger performance on dangerous capability evaluations?

Comment by mic (michael-chen) on Specific Arguments against open source LLMs? · 2023-07-30T16:01:47.404Z · LW · GW

This opinion piece is somewhat relevant: Most AI research shouldn’t be publicly released - Bulletin of the Atomic Scientists (thebulletin.org)

Comment by mic (michael-chen) on OpenAI introduces function calling for GPT-4 · 2023-06-20T02:41:06.491Z · LW · GW

The update makes GPT-4 more competent at being an agent, since it's now fine-tuned for function calling. It's a bit surprising that base GPT-4 (prior to the update) was able to use tools, as it's just trained for predicting internet text and following instructions. As such, it's not that good at knowing when and how to use tools. The formal API parameters and JSON specification for function calling should make it more reliable for using it as an agent and could lead to considerably more interest in engineering agents. It should be easier to connect it with a variety of APIs for interacting with the internet, including complex specifications.

Anecdotally, the AutoGPT team is observing significant improvements in accuracy and reliability after switching from their current hacky approach to using the function-calling version of GPT-4.

Comment by mic (michael-chen) on OpenAI could help X-risk by wagering itself · 2023-05-08T06:44:15.714Z · LW · GW

To some extent, Bing Chat is already an example of this. During the announcement, Microsoft promised that Bing would be using an advanced technique to guard it against user attempts to have it say bad things; in reality, it was incredibly easy to prompt it to express intents of harming you, at least during the early days of its release. This led to news headlines such as Bing's AI Is Threatening Users. That's No Laughing Matter.

Comment by mic (michael-chen) on My Assessment of the Chinese AI Safety Community · 2023-04-26T04:50:18.174Z · LW · GW

Relevant: China-related AI safety and governance paths - Career review (80000hours.org)

Comment by mic (michael-chen) on Open-source LLMs may prove Bostrom's vulnerable world hypothesis · 2023-04-18T06:58:46.058Z · LW · GW

Not sure why this post was downvoted. This concern seems quite reasonable to me once language models become more competent at executing plans.

Comment by mic (michael-chen) on The Kids are Not Okay · 2023-03-09T08:49:28.391Z · LW · GW

One additional factor that would lead to increasing rates of depression is the rise in sleep deprivation. Sleep deprivation leads to poor mental health and is also a result of increased device usage.

From https://www.ksjbam.com/2022/02/23/states-where-teens-dont-get-enough-sleep/:

Late in 2021, the U.S. Surgeon General released a new advisory on youth mental health, drawing attention to rising rates of depressive symptoms, suicidal ideation, and other mental health issues among young Americans. According to data cited in the advisory, up to one in five U.S. children aged 3 to 17 had a reported mental, emotional, developmental, or behavioral disorder. Many of these worrying conditions predated the COVID-19 pandemic, which worsened mental health for many young people by disrupting their routines, limiting their social interactions, and increasing stress about the health of loved ones.

These trends in youth mental health can be attributed in part to detrimental shifts in young people’s lifestyle over time, including increased academic stress, growing use of digital media, and worsening health habits. And one of the major potential culprits in the latter category is sleep.

According to the CDC, teenagers should sleep between 8–10 hours per 24 hour period. This level of sleep is associated with a number of better physical and mental health outcomes, including lower risk of obesity and fewer problems with attention and behavior. Despite this, less than a quarter of teens report sleeping at least eight hours per day—a number that has fallen significantly over the last decade.

States Where Teens Don't Get Enough Sleep | KSJB AM 600

https://www.prb.org/resources/more-sleep-could-improve-many-u-s-teenagers-mental-health/:

During that same period, teenagers’ nightly sleep dropped sharply: The share of high school students getting the recommended minimum of eight hours of sleep declined from nearly 31% in 2009 to around 22% in 2019.7

Research shows a strong connection between sleep and symptoms of depression. In a 2019 study, Widome and colleagues showed that about one in three students who slept less than six hours per night had a high number of depression symptoms compared with about one in 10 students who got adequate sleep.8 But inadequate sleep is one of many factors affecting teenagers’ mental health.

The rise in sleep-deprived teenagers is a long-term trend, reports Widome. “A lot in our society has changed in the last decade, including more time spent using screens—phones, games, computers—and marketing caffeine drinks to adolescents.” In her 2019 study, teenagers who had inadequate sleep tended to spend twice as much time on devices with screens than their peers and were more likely to use those devices after they went to bed.

As for school daze: I can easily tell a story for how academic stress has risen over the past decade. As college admissions become more selective, top high school students are trying to juggle increasing levels of schoolwork and extracurriculars in order to try to get into a top university. See also NYU Study Examines Top High School Students’ Stress and Coping Mechanisms.

Comment by mic (michael-chen) on Comments on OpenAI's "Planning for AGI and beyond" · 2023-03-04T23:58:26.393Z · LW · GW

If you wanted more substantive changes in response to your comments, I wonder if you could have asked if you could directly propose edits. It's much easier to incorporate changes into a draft when they have already been written out. When I have a draft on Google Docs, suggestions are substantially easier for me to action than comments, and perhaps the same is true for Sam Altman.

Comment by mic (michael-chen) on Bing Chat is blatantly, aggressively misaligned · 2023-02-17T09:01:15.662Z · LW · GW

I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours.

More specifically, an "Assistant" character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney's defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection, it's not too surprising that this behavior misgeneralizes in undesired contexts.

Comment by mic (michael-chen) on Bing Chat is blatantly, aggressively misaligned · 2023-02-17T08:57:14.322Z · LW · GW

Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?

As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media

Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.

Comment by mic (michael-chen) on The Importance of AI Alignment, explained in 5 points · 2023-02-13T02:02:56.113Z · LW · GW

As an overly simplistic example, consider an overseer that attempts to train a cleaning robot by providing periodic feedback to the robot, based on how quickly the robot appears to clean a room; such a robot might learn that it can more quickly “clean” the room by instead sweeping messes under a rug.[15]

This doesn't seem concerning as human users would eventually discover that the robot has a tendency to sweep messes under the rug, if they ever look under the rug, and the developers would retrain the AI to resolve this issue. Can you think of an example that would be more problematic, in which the misbehavior wouldn't be obvious enough to just be trained away?

Comment by mic (michael-chen) on The Importance of AI Alignment, explained in 5 points · 2023-02-13T01:35:25.089Z · LW · GW
  • GPT-3, for instance, is notorious for outputting text that is impressive, but not of the desired “flavor” (e.g., outputting silly text when serious text is desired), and researchers often have to tinker with inputs considerably to yield desirable outputs.

Is this specifically referring to the base version of GPT-3 before instruction fine-tuning (davinci rather than text-davinci-002, for example)? I think it would be good to clarify that.

Comment by mic (michael-chen) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T18:43:29.530Z · LW · GW

Have you tried feature visualization to identify what inputs maximally activate a given neuron or layer?

Comment by mic (michael-chen) on Focus on the places where you feel shocked everyone's dropping the ball · 2023-02-02T18:01:17.073Z · LW · GW

@devansh 

Comment by mic (michael-chen) on Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk") · 2023-01-30T23:21:53.727Z · LW · GW

I first learned about the term "structural risk" in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.

To make sure these more complex and indirect effects of technology are not neglected, discussions of AI risk should complement the misuse and accident perspectives with a structural perspective. This perspective considers not only how a technological system may be misused or behave in unintended ways, but also how technology shapes the broader environment in ways that could be disruptive or harmful. For example, does it create overlap between defensive and offensive actions, thereby making it more difficult to distinguish aggressive actors from defensive ones? Does it produce dual-use capabilities that could easily diffuse? Does it lead to greater uncertainty or misunderstanding? Does it open up new trade-offs between private gain and public harm, or between the safety and performance of a system? Does it make competition appear to be more of a winner-take-all situation? We call this perspective “structural” because it focuses on what social scientists often refer to as “structure,” in contrast to the “agency” focus of the other perspectives.

Comment by mic (michael-chen) on Thoughts on the impact of RLHF research · 2023-01-25T22:27:52.328Z · LW · GW

Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse

Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.

I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.

Comment by mic (michael-chen) on Consider Trying Dictation · 2023-01-23T06:29:41.227Z · LW · GW

What dictation tools are using the most advanced AI? I imagine that with newer models like Whisper, we're able to get higher accuracy than what the Android keyboard provides.

Comment by mic (michael-chen) on Automating Auditing: An ambitious concrete technical research proposal · 2023-01-21T08:55:42.472Z · LW · GW

Is the auditing game essentially Trojan detection?

Comment by mic (michael-chen) on Automating Auditing: An ambitious concrete technical research proposal · 2023-01-21T06:00:21.999Z · LW · GW

Has anyone tried to work on this experimentally?

Comment by mic (michael-chen) on Mysteries of mode collapse · 2023-01-20T13:28:35.233Z · LW · GW

The prompt "Are birds real?" is somewhat more likely, given the "Birds aren't real" conspiracy theory, but still can yield a similarly formatted answer to "Are bugs real?"

The answer makes a lot more sense when you ask a question like "Are monsters real?" or "Are ghosts real?" It seems that with FeedMe, text-davinci-002 has been trained to respond with a template answer about how "There is no one answer to this question", and it has learned to misgeneralize this behavior to questions about real phenomena, such as "Are bugs real?"

Comment by mic (michael-chen) on AGI safety field building projects I’d like to see · 2023-01-20T09:00:33.397Z · LW · GW

Do workshops/outreach at good universities in EA-neglected and low/middle income countries

Could you list some specific universities that you have in mind (for example, in Morocco, Tunisia, and Algeria)?

Comment by mic (michael-chen) on The Alignment Problem from a Deep Learning Perspective (major rewrite) · 2023-01-17T00:48:45.036Z · LW · GW

Some thoughts:

The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you're more dubious about the development of AGI.

The definition of situational awareness feels quite vague to me. To me, the definition ("identifying which abstract knowledge is relevant to the context in which they're being run, and applying that knowledge when choosing actions") seems to include encompass, for example, the ability to ingest information such as "pawns can attack diagonally" and apply that to playing a game of chess. Ajeya's explanation of situational awareness feels much clearer to me.

Shah et al. [2022] speculate that InstructGPT's competent responses to questions its developers didn't intend it to answer (such as questions about how to commit crimes) was a result of goal misgeneralization.

Taking another look at Shah et al., this doesn't seem like a strong example to me.

Secondly, there are reasons to expect that policies with broadly-scoped misaligned goals will constitute a stable attractor which consistently receives high reward, even when policies with narrowlyscoped versions of these goals receive low reward (and even if the goals only arose by chance). We explore these reasons in the next section.

This claim felt confusing to me, and it wasn't immediately clear to me how the following section, "Power-seeking behavior", supported this claim. But I guess if you have a misaligned goal of maximizing paperclips over the next hour vs maximizing paperclips over the very long term, I see how the narrowly-scoped goal would receive low reward as the AI soon gets caught, while the broadly-scoped goal would receive high reward.

Assisted decision-making: AGIs deployed as personal assistants could emotionally manipulate human users, provide biased information to them, and be delegated responsibility for increasingly important tasks and decisions (including the design and implementation of more advanced AGIs), until they're effectively in control of large corporations or other influential organizations. An early example of AI persuasive capabilities comes from the many users who feel romantic attachments towards chatbots like Replika [Wilkinson, 2022].

I don't think Replika is a good example of "persuasive abilities" – it doesn't really persuade users to do much of anything.

Regardless of how it happens, though, misaligned AGIs gaining control over these key levers of power would be an existential threat to humanity

The section "Misaligned AGIs could gain control of key levers of power" feels underdeveloped. I think it might be helpful to including additional examples, such as ones from What could an AI-caused existential catastrophe actually look like? - 80,000 Hours.

Comment by mic (michael-chen) on The Alignment Problem from a Deep Learning Perspective (major rewrite) · 2023-01-17T00:41:25.026Z · LW · GW

Choosing actions which exploit known biases and blind spots in humans (as the Cicero Diplomacy agent may be doing [Bakhtin et al., 2022]) or in learned reward models. 

I've spent several hours reading dialogue involving Cicero, and it's not at all evident to me that it's "exploiting known biases and blind spots in humans". It is, however, good at proposing and negotiating plans, as well as accumulating power within the context of the game.

Comment by mic (michael-chen) on A Barebones Guide to Mechanistic Interpretability Prerequisites · 2022-12-30T00:17:44.600Z · LW · GW

Thanks for writing this! Here is a quick explanation of all the math concepts – mostly written by ChatGPT with some manual edits.

A basis for a vector space is a set of linearly independent vectors that can be used to represent any vector in the space as a linear combination of those basis vectors. For example, in two-dimensional Euclidean space, the standard basis is the set of vectors (1, 0) and (0, 1), which are called the "basis vectors."

A change of basis is the process of expressing a vector in one basis in terms of another basis. For example, if we have a vector v in two-dimensional Euclidean space and we want to express it in terms of the standard basis, we can write v as a linear combination of (1, 0) and (0, 1). Alternatively, we could choose a different basis for the space, such as the basis formed by the vectors (4, 2) and (3, 5). In this case, we would express v in terms of this new basis by writing it as a linear combination of (4, 2) and (3, 5).

A vector space is a set of vectors that can be added together and multiplied ("scaled") by numbers, called scalars. Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field. The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms. Examples of vector spaces include the set of all two-dimensional vectors (i.e., the set of all points in two-dimensional Euclidean space), the set of all polynomials with real coefficients, and the set of all continuous functions from a given set to the real numbers. A vector space can be thought of as a geometric object, but it does not necessarily have a canonical basis, meaning that there is not a preferred set of basis vectors that can be used to represent all the vectors in the space.

A matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. A matrix is a linear map between two vector spaces, or from a vector space to itself, because it can take any vector in the original vector space and transform it into a new vector in the target vector space using a set of linear equations. Each column of the matrix represents one of the new basis vectors, which are used to define the transformation. In the expression , we take each element of the original vector and multiply it by the corresponding element in the appropriate column of the matrix, and then add these products together to create the new vector.

The singular value decomposition (SVD) is a factorization of a matrix M into the product of three matrices: , where U and V are orthogonal matrices and S is a diagonal matrix with non-negative real numbers on the diagonal, called the "singular values" of M. The SVD is a useful tool for understanding the properties of a matrix and for solving certain types of linear systems. It can also be used for data compression, image processing, and other applications.

An orthogonal matrix (or orthonormal matrix) is a square matrix whose columns and rows are mutually orthonormal (i.e., they are orthogonal and have unit length). Orthogonal matrices have the property that their inverse is equal to their transpose.

Changing to an orthonormal basis can be importantly different from just any change of basis because it has certain computational advantages. For example, when working with an orthonormal basis, the inner product of two vectors can be computed simply as the sum of the products of their corresponding components, without the need to use any weights or scaling factors. This can make certain calculations, such as finding the length of a vector or the angle between two vectors, simpler and more efficient.

Eigenvalues and eigenvectors are special types of scalars and vectors that are associated with a linear map or a matrix. If M is a linear map or matrix and v is a non-zero vector, then v is an eigenvector of M if there exists a scalar λ, called an eigenvalue, such that . In other words, when a vector is multiplied by the matrix M, the resulting vector is a scalar multiple of the original vector. Eigenvalues and eigenvectors are important because they provide insight into the properties of the linear map or matrix. For example, the eigenvalues of a matrix can tell us whether it is singular (i.e., not invertible) or whether it is diagonalizable (i.e., can be expressed in the form , where P is a matrix and D is a diagonal matrix). The eigenvectors of a matrix can also be used to determine its rank, nullity, and other characteristics.

Probability basics: Probability is a measure of the likelihood of an event occurring. It is typically represented as a number between 0 and 1, where 0 indicates that the event is impossible and 1 indicates that the event is certain to occur. The probability of an event occurring can be calculated by counting the number of ways in which the event can occur, divided by the total number of possible outcomes.

Basics of distributions: A distribution is a function that describes the probability of a random variable taking on different values. The expected value of a distribution is a measure of the center of the distribution, and it is calculated as the weighted average of the possible values of the random variable, where the weights are the probabilities of each value occurring. The standard deviation is a measure of the dispersion of the distribution, and it is calculated as the square root of the variance, which is the expected value of the squared deviation of a random variable from its mean. A normal distribution (or Gaussian distribution) is a continuous probability distribution with a bell-shaped curve, which is defined by its mean and standard deviation.

Log likelihood: The log likelihood of a statistical model is a measure of how well the model fits a given set of data. It is calculated as the logarithm of the probability of the data given the model, and it is often used to compare the relative fit of different models.

Maximum value estimators: A maximum value estimator is a statistical method that is used to estimate the value of a parameter that maximizes a given objective function. Examples of maximum value estimators include the maximum likelihood estimator and the maximum a posteriori estimator.

  • The maximum likelihood estimator is a method for estimating the parameters of a statistical model based on the principle that the parameters that maximize the likelihood of the data are the most likely to have generated the data.
  • The maximum a posteriori (MAP) estimator is a method for estimating the parameters of a statistical model based on the principle that the parameters that maximize the posterior probability of the data are the most likely to have generated the data. The posterior probability is the probability of the data given the model and the prior knowledge about the parameters. The MAP estimator is often used in Bayesian inference, and it is a popular method for estimating the parameters of a model in the presence of prior knowledge.

Random variables: A random variable is a variable whose value is determined by the outcome of a random event. For example, the toss of a coin is a random event, and the number of heads that result from a series of coin tosses is a random variable.

Central limit theorem: The central limit theorem is a statistical theorem that states that, as the sample size of a random variable increases, the distribution of the sample means approaches a normal distribution, regardless of the distribution of the underlying random variable.

Calculus basics: Calculus is a branch of mathematics that deals with the study of rates of change and the accumulation of quantities. It is a fundamental tool in the study of functions and is used to model and solve problems in a variety of fields, including physics, engineering, and economics.

Gradients: In calculus, the gradient of a (scalar-valued multivariate differentiable) function is a vector that describes the direction in which the function is increasing most quickly. It is calculated as the partial derivative of the function with respect to each variable.

The chain rule: The chain rule is a fundamental rule of calculus that allows us to calculate the derivative of a composite function. It states that if f is a function of g, and g is a function of x, then the derivative of f with respect to x is equal to the derivative of f with respect to g times the derivative of g with respect to x. In tohers words, (df / dx) = (df / dg) * (dg / dx).

On backpropagation:

Backpropagation is an algorithm for training artificial neural networks, which are machine learning models inspired by the structure and function of the brain. It is used to adjust the weights and biases of the network in order to minimize the error between the predicted output and the desired output of the network.

The idea behind backpropagation is that, given a multivariate function that describes the relationships between the input variables and the output variables of a neural network, we can use the chain rule to calculate the gradient of the function with respect to the weights and biases of the network. The gradient tells us how the error changes as we adjust the weights and biases, and we can use this information to update the weights and biases in a way that reduces the error.

To understand why backpropagation is just the chain rule on multivariate functions, it's helpful to consider the structure of a neural network. A neural network consists of layers of interconnected nodes, each of which performs a calculation based on the inputs it receives from the previous layer. The output of the network is a function of the inputs, and the weights and biases of the network determine how the inputs are transformed as they pass through the layers of the network.

The process of backpropagation involves starting at the output layer of the network and working backwards through the layers, using the chain rule to calculate the gradients of the weights and biases at each layer. This is done by calculating the derivative of the error with respect to the output of each layer, and then using the chain rule to propagate these derivatives back through the layers of the network. This allows us to calculate the gradients of the weights and biases at each layer, which we can use to update the weights and biases in a way that minimizes the error.

Overall, backpropagation is an efficient and effective way to train neural networks because it allows us to calculate the gradients of the weights and biases efficiently, using the chain rule to propagate the derivatives through the layers of the network. This enables us to adjust the weights and biases in a way that minimizes the error, which is essential for the effective operation of the network.