Posts

Akash's Shortform 2024-04-18T15:44:25.096Z
Cooperating with aliens and AGIs: An ECL explainer 2024-02-24T22:58:47.345Z
OpenAI's Preparedness Framework: Praise & Recommendations 2024-01-02T16:20:04.249Z
Speaking to Congressional staffers about AI risk 2023-12-04T23:08:52.055Z
Navigating emotions in an uncertain & confusing world 2023-11-20T18:16:09.492Z
International treaty for global compute caps 2023-11-09T18:17:04.952Z
Chinese scientists acknowledge xrisk & call for international regulatory body [Linkpost] 2023-11-01T13:28:43.723Z
Winners of AI Alignment Awards Research Contest 2023-07-13T16:14:38.243Z
AI Safety Newsletter #8: Rogue AIs, how to screen for AI risks, and grants for research on democratic governance of AI 2023-05-30T11:52:31.669Z
AI Safety Newsletter #7: Disinformation, Governance Recommendations for AI labs, and Senate Hearings on AI 2023-05-23T21:47:34.755Z
Eisenhower's Atoms for Peace Speech 2023-05-17T16:10:38.852Z
AI Safety Newsletter #6: Examples of AI safety progress, Yoshua Bengio proposes a ban on AI agents, and lessons from nuclear arms control 2023-05-16T15:14:45.921Z
AI Safety Newsletter #5: Geoffrey Hinton speaks out on AI risk, the White House meets with AI labs, and Trojan attacks on language models 2023-05-09T15:26:55.978Z
AI Safety Newsletter #4: AI and Cybersecurity, Persuasive AIs, Weaponization, and Geoffrey Hinton talks AI risks 2023-05-02T18:41:43.144Z
Discussion about AI Safety funding (FB transcript) 2023-04-30T19:05:34.009Z
Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) 2023-04-25T18:49:29.042Z
DeepMind and Google Brain are merging [Linkpost] 2023-04-20T18:47:23.016Z
AI Safety Newsletter #2: ChaosGPT, Natural Selection, and AI Safety in the Media 2023-04-18T18:44:35.923Z
Request to AGI organizations: Share your views on pausing AI progress 2023-04-11T17:30:46.707Z
AI Safety Newsletter #1 [CAIS Linkpost] 2023-04-10T20:18:57.485Z
Reliability, Security, and AI risk: Notes from infosec textbook chapter 1 2023-04-07T15:47:16.581Z
New survey: 46% of Americans are concerned about extinction from AI; 69% support a six-month pause in AI development 2023-04-05T01:26:51.830Z
[Linkpost] Critiques of Redwood Research 2023-03-31T20:00:09.784Z
What would a compute monitoring plan look like? [Linkpost] 2023-03-26T19:33:46.896Z
The Overton Window widens: Examples of AI risk in the media 2023-03-23T17:10:14.616Z
The Wizard of Oz Problem: How incentives and narratives can skew our perception of AI developments 2023-03-20T20:44:29.445Z
[Linkpost] Scott Alexander reacts to OpenAI's latest post 2023-03-11T22:24:39.394Z
Questions about Conjecure's CoEm proposal 2023-03-09T19:32:50.600Z
AI Governance & Strategy: Priorities, talent gaps, & opportunities 2023-03-03T18:09:26.659Z
Fighting without hope 2023-03-01T18:15:05.188Z
Qualities that alignment mentors value in junior researchers 2023-02-14T23:27:40.747Z
4 ways to think about democratizing AI [GovAI Linkpost] 2023-02-13T18:06:41.208Z
How evals might (or might not) prevent catastrophic risks from AI 2023-02-07T20:16:08.253Z
[Linkpost] Google invested $300M in Anthropic in late 2022 2023-02-03T19:13:32.112Z
Many AI governance proposals have a tradeoff between usefulness and feasibility 2023-02-03T18:49:44.431Z
Talk to me about your summer/career plans 2023-01-31T18:29:23.351Z
Advice I found helpful in 2022 2023-01-28T19:48:23.160Z
11 heuristics for choosing (alignment) research projects 2023-01-27T00:36:08.742Z
"Status" can be corrosive; here's how I handle it 2023-01-24T01:25:04.539Z
[Linkpost] TIME article: DeepMind’s CEO Helped Take AI Mainstream. Now He’s Urging Caution 2023-01-21T16:51:09.586Z
Wentworth and Larsen on buying time 2023-01-09T21:31:24.911Z
[Linkpost] Jan Leike on three kinds of alignment taxes 2023-01-06T23:57:34.788Z
My thoughts on OpenAI's alignment plan 2022-12-30T19:33:15.019Z
An overview of some promising work by junior alignment researchers 2022-12-26T17:23:58.991Z
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic 2022-12-20T21:39:41.866Z
12 career-related questions that may (or may not) be helpful for people interested in alignment research 2022-12-12T22:36:21.936Z
Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas 2022-11-25T20:47:09.832Z
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility 2022-11-22T22:19:09.419Z
Ways to buy time 2022-11-12T19:31:10.411Z
Instead of technical research, more people should focus on buying time 2022-11-05T20:43:45.215Z

Comments

Comment by Akash (akash-wasil) on Express interest in an "FHI of the West" · 2024-04-19T00:46:02.968Z · LW · GW

To what extent would the organization be factoring in transformative AI timelines? It seems to me like the kinds of questions one would prioritize in a "normal period" look very different than the kinds of questions that one would prioritize if they place non-trivial probability on "AI may kill everyone in <10 years" or "AI may become better than humans on nearly all cognitive tasks in <10 years."

I ask partly because I personally would be more excited of a version of this that wasn't ignoring AGI timelines, but I think a version of this that's not ignoring AGI timelines would probably be quite different from the intellectual spirit/tradition of FHI.

More generally, perhaps it would be good for you to describe some ways in which you expect this to be different than FHI. I think the calling it the FHI of the West, the explicit statement that it would have the intellectual tradition of FHI, and the announcement right when FHI dissolves might make it seem like "I want to copy FHI" as opposed to "OK obviously I don't want to copy it entirely I just want to draw on some of its excellent intellectual/cultural components." If your vision is the latter, I'd find it helpful to see a list of things that you expect to be similar/different.)

Comment by Akash (akash-wasil) on peterbarnett's Shortform · 2024-04-19T00:25:23.573Z · LW · GW

I would strongly suggest considering hires who would be based in DC (or who would hop between DC and Berkeley). In my experience, being in DC (or being familiar with DC & having a network in DC) is extremely valuable for being able to shape policy discussions, know what kinds of research questions matter, know what kinds of things policymakers are paying attention to, etc.

I would go as far as to say something like "in 6 months, if MIRI's technical governance team has not achieved very much, one of my top 3 reasons for why MIRI failed would be that they did not engage enough with DC people//US policy people. As a result, they focused too much on questions that Bay Area people are interested in and too little on questions that Congressional offices and executive branch agencies are interested in. And relatedly, they didn't get enough feedback from DC people. And relatedly, even the good ideas they had didn't get communicated frequently enough or fast enough to relevant policymakers. And relatedly... etc etc."

I do understand this trades off against everyone being in the same place, which is a significant factor, but I think the cost is worth it. 

Comment by Akash (akash-wasil) on Akash's Shortform · 2024-04-19T00:18:28.934Z · LW · GW

I do think evaporative cooling is a concern, especially if everyone (or a very significant amount) of people left. But I think on the margin more people should be leaving to work in govt. 

I also suspect that a lot of systemic incentives will keep a greater-than-optimal proportion of safety-conscious people at labs as opposed to governments (labs pay more, labs are faster and have less bureaucracy, lab people are much more informed about AI, labs are more "cool/fun/fast-paced", lots of govt jobs force you to move locations, etc.)

I also think it depends on the specific lab– EG in light of the recent OpenAI departures, I suspect there's a stronger case for staying at OpenAI right now than for DeepMind or Anthropic. 

Comment by Akash (akash-wasil) on AI #60: Oh the Humanity · 2024-04-18T15:48:33.702Z · LW · GW

Daniel Kokotajlo has quit OpenAI

I think now is a good time for people at labs to seriously consider quitting & getting involved in government/policy efforts.

I don't think everyone should leave labs (obviously). But I would probably hit a button that does something like "everyone at a lab governance team and many technical researchers spend at least 2 hours thinking/writing about alternative options they have & very seriously consider leaving."

My impression is that lab governance is much less tractable (lab folks have already thought a lot more about AGI) and less promising (competitive pressures are dominating) than government-focused work. 

I think governments still remain unsure about what to do, and there's a lot of potential for folks like Daniel K to have a meaningful role in shaping policy, helping natsec folks understand specific threat models, and raising awareness about the specific kinds of things governments need to do in order to mitigate risks.

There may be specific opportunities at labs that are very high-impact, but I think if someone at a lab is "not really sure if what they're doing is making a big difference", I would probably hit a button that allocates them toward government work or government-focused comms work.

Comment by Akash (akash-wasil) on Akash's Shortform · 2024-04-18T15:44:25.830Z · LW · GW

I think now is a good time for people at labs to seriously consider quitting & getting involved in government/policy efforts.

I don't think everyone should leave labs (obviously). But I would probably hit a button that does something like "everyone at a lab governance team and many technical researchers spend at least 2 hours thinking/writing about alternative options they have & very seriously consider leaving."

My impression is that lab governance is much less tractable (lab folks have already thought a lot more about AGI) and less promising (competitive pressures are dominating) than government-focused work. 

I think governments still remain unsure about what to do, and there's a lot of potential for folks like Daniel K to have a meaningful role in shaping policy, helping natsec folks understand specific threat models, and raising awareness about the specific kinds of things governments need to do in order to mitigate risks.

There may be specific opportunities at labs that are very high-impact, but I think if someone at a lab is "not really sure if what they're doing is making a big difference", I would probably hit a button that allocates them toward government work or government-focused comms work.

Written on a Slack channel in response to discussions about some folks leaving OpenAI. 

Comment by Akash (akash-wasil) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-17T14:27:02.108Z · LW · GW

I think this should be broken down into two questions:

  1. Before the EO, if we were asked to figure out where this kind of evals should happen, what institution would we pick & why?
  2. After the EO, where does it make sense for evals-focused people to work?

I think the answer to #1 is quite unclear. I personally think that there was a strong case that a natsec-focused USAISI could have been given to DHS or DoE or some interagency thing. In addition to the point about technical expertise, it does seem relatively rare for Commerce/NIST to take on something that is so natsec-focused. 

But I think the answer to #2 is pretty clear. The EO clearly tasks NIST with this role, and now I think our collective goal should be to try to make sure NIST can execute as effectively as possible. Perhaps there will be future opportunities to establish new places for evals work, alignment work, risk monitoring and forecasting work, emergency preparedness planning, etc etc. But for now, whether we think it was the best choice or not, NIST/USAISI are clearly the folks who are tasked with taking the lead on evals + standards.

Comment by Akash (akash-wasil) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-16T20:48:23.535Z · LW · GW

I'm excited to see how Paul performs in the new role. He's obviously very qualified on a technical level, and I suspect he's one of the best people for the job of designing and conducting evals.

I'm more uncertain about the kind of influence he'll have on various AI policy and AI national security discussions. And I mean uncertain in the genuine "this could go so many different ways" kind of way. 

Like, it wouldn't be particularly surprising to me if any of the following occurred:

  • Paul focuses nearly all of his efforts on technical evals and doesn't get very involved in broader policy conversations
  • Paul is regularly asked to contribute to broader policy discussions, and he advocates for RSPs and other forms of voluntary commitments.
  • Paul is regularly asked to contribute to broader policy discussions, and he advocates for requirements that go beyond voluntary commitments and are much more ambitious than what he advocated for when he was at ARC.
  • Paul is regularly asked to contribute to broader policy discussions, and he's not very good at communicating his beliefs in ways that are clear/concise/policymaker-friendly, so his influence on policy discussions is rather limited.
  • Paul [is/isn't] able to work well with others who have very different worldviews and priorities.

Personally, I see this as a very exciting opportunity for Paul to form an identity as a leader in AI policy. I'm guessing the technical work will be his priority (and indeed, it's what he's being explicitly hired to do), but I hope he also finds ways to just generally improve the US government's understanding of AI risk and the likelihood of implementing reasonable policies. On the flipside, I hope he doesn't settle for voluntary commitments (especially as the Overton Window shifts) & I hope he's clear/open about the limitations of RSPs.

More specifically, I hope he's able to help policymakers reason about a critical question: what do we do after we've identified models with (certain kinds of) dangerous capabilities? I think the underlying logic behind RSPs could actually be somewhat meaningfully applied to USG policy. Like, I think we would be in a safer world if the USG had an internal understanding of ASL levels, took seriously the possibility of various dangerous capabilities thresholds being crossed, took seriously the idea that AGI/ASI could be developed soon, and had preparedness plans in place that allowed them to react quickly in the event of a sudden risk. 

Anyways, a big congratulations to Paul, and definitely some evidence that the USAISI is capable of hiring some technical powerhouses. 

Comment by Akash (akash-wasil) on What convincing warning shot could help prevent extinction from AI? · 2024-04-16T20:22:41.405Z · LW · GW

I'll admit I have only been loosely following the control stuff, but FWIW I would be excited about a potential @peterbarnett & @ryan_greenblatt dialogue in which you two to try to identify & analyze any potential disagreements. Example questions:

  • What is the most capable system that you think we are likely to be able to control?
  • What kind of value do you think we could get out of such a system?
  • To what extent do you expect that system to be able to produce insights that help us escape the acute risk period (i.e., get out of a scenario where someone else can come along and build a catastrophe-capable system without implementing control procedures or someone else comes along and scales to the point where the control procedures are no longer sufficient)
Comment by Akash (akash-wasil) on Anthropic AI made the right call · 2024-04-15T23:02:10.940Z · LW · GW

Here are three possible scenarios:

Scenario 1, Active Lying– Anthropic staff were actively spreading the idea that they would not push the frontier.

Scenario 2, Allowing misconceptions to go unchecked– Anthropic staff were aware that many folks in the AIS world thought that Anthropic had committed to not pushing the frontier, and they allowed this misconception to go unchecked, perhaps because they realized that it was a misconception that favored their commercial/competitive interests.

Scenario 3, Not being aware– Anthropic staff were not aware that many folks had this belief. Maybe they heard it once or twice but it never really seemed like a big deal.

Scenario 1 is clearly bad. Scenarios 2 and 3 are more interesting. To what extent does Anthropic have the responsibility to clarify misconceptions (avoid scenario 2) and even actively look for misconceptions (avoid scenario 3)?

I expect this could matter tangibly for discussions of RSPs. My opinion is that the Anthropic RSP is written in such a way that readers can come away with rather different expectations of what kinds of circumstances would cause Anthropic to pause/resume.

It wouldn't be very surprising to me if we end up seeing a situation where many readers say "hey look, we've reached an ASL-3 system, so now you're going to pause, right?" And then Anthropic says "no no, we have sufficient safeguards– we can keep going now." And then some readers say "wait a second– what? I'm pretty sure you committed to pausing until your safeguards were better than that." And then Anthropic says "no... we never said exactly what kinds of safeguards we would need, and our leadership's opinion is that our safeguards are sufficient, and the RSP allows leadership to determine when it's fine to proceed."

In this (hypothetical) scenario, Anthropic never lied, but it benefitted from giving off a more cautious impression, and it didn't take steps to correct this impression.

I think avoiding these kinds of scenarios requires some mix of:

  • Clear, specific falsifiable statements on behalf of labs. 
  • Some degree of proactive attempts to identify and alleviate misconceptions

One counterargument is something like "Anthropic is a company, and there are lots of things to do, and this is is demanding an unusually high amount of attention-to-detail and proactive communication that is not typically expected of companies." To which my response is something like "yes, but I think it's reasonable to hold companies to such standards if they wish to develop AGI. I think we ought to hold Anthropic and other labs to this standard, especially insofar as they want the benefits associated with being perceived as the kind of safety-conscious lab that refuses to push the frontier or commits to scaling policies that include tangible/concrete plans to pause."

Comment by Akash (akash-wasil) on What convincing warning shot could help prevent extinction from AI? · 2024-04-15T21:32:10.409Z · LW · GW

I like these examples. One thing I'll note, however, is that I think the "warning shot discourse" on LW tends to focus on warning shots that would be convincing to a LW-style audience.

If the theory of the change behind the warning shot requires LW-types (for example, folks at OpenAI/DeepMind/Anthropic who are relatively familiar with AGI xrisk arguments) to become concerned, this makes sense.

But usually, when I think about winning worlds that involve warning shots, I think about government involvement as the main force driving coordination, an end to race dynamics, etc.

[Caveating that my models of the USG and natsec community are still forming, so epistemic status is quite uncertain for the rest of this message, but I figure some degree of speculation could be helpful anyways].

I expect the kinds of warning shots that would be concerning to governments/national security folks will look quite different than the kinds of warning shots that would be convincing to technical experts.

LW-style warning shots tend to be more– for a lack of a better term– rational. They tend to be rooted in actual threat models (e.g., we understand that if an AI can copy its weights, it can create tons of copies and avoid being easily turned off, and we also understand that its general capabilities are sufficiently strong that we may be close to an intelligence explosion or highly capable AGI).

In contrast, without this context, I don't think that "we caught an AI model copying its weights" would necessarily be a warning shot for USG/natsec folks. It could just come across as "oh something weird happened but then the company caught it and fixed it." Instead, I suspect the warning shots that are most relevant to natsec folks might be less "rational", and by that I mean "less rooted in actual AGI threat models but more rooted in intuitive things that seem scary."

Examples of warning shots that I expect USG/natsec people would be more concerned about:

  • An AI system can generate novel weapons of mass destruction
  • Someone uses an AI system to hack into critical infrastructure or develop new zero-days that impress folks in the intelligence community.
  • A sudden increase in the military capabilities of AI

These don't relate as much to misalignment risk or misalignment-focused xrisk threat models. As a result, a disadvantage of these warning shots is that it may be harder to make a convincing case for interventions that focus specifically on misalignment. However, I think they are the kinds of things that might involve a sudden increase in the attention that the USG places on AI/AGI, the amount of resources it invests into understanding national security threats from AI, and its willingness to take major actions to intervene in the current AI race.

As such, in addition to asking questions like "what is the kind of warning shot that would convince me and my friends that we have something dangerous", I think it's worth separately asking "what is the kind of warning shot that would convince natsec folks that something dangerous or important is occurring, regardless of whether or not it connects neatly to AGI risk threat models." 

My impression is that the policy-relevant warning shots will be the most important ones to be prepared for, and the community may (for cultural/social/psychological reasons) be focusing too little effort on trying to prepare for these kinds of "irrational" warning shots. 

Comment by Akash (akash-wasil) on Speaking to Congressional staffers about AI risk · 2024-02-28T21:28:08.435Z · LW · GW

If memory serves me well, I was informed by Hendrycks' overview of catastrophic risks. I don't think it's a perfect categorization, but I think it does a good job laying out some risks that feel "less speculative" (e.g., malicious use, race dynamics as a risk factor that could cause all sorts of threats) while including those that have been painted as "more speculative" (e.g., rogue AIs).

I've updated toward the importance of explaining & emphasizing risks from sudden improvements in AI capabilities, AIs that can automate AI research, and intelligence explosions. I also think there's more appetite for that now than there used to be. 

Comment by Akash (akash-wasil) on Daniel Kokotajlo's Shortform · 2024-02-23T14:22:31.341Z · LW · GW

What work do you think is most valuable on the margin (for those who agree with you on many of these points)?

Comment by Akash (akash-wasil) on The case for ensuring that powerful AIs are controlled · 2024-01-26T10:48:32.835Z · LW · GW

I'd be curious to hear more about your main disagreements.

Comment by Akash (akash-wasil) on The case for ensuring that powerful AIs are controlled · 2024-01-26T10:47:15.352Z · LW · GW

Good point. I think it's helpful when people working schemes with the rough flavor of "we do X, and then X helps us get to a useful AI that does not takeover" try to specify roughly how capable they expect the "useful AI does not takeover" to be.

Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first "transformatively useful but still controllable" AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).

Comment by Akash (akash-wasil) on The case for ensuring that powerful AIs are controlled · 2024-01-25T12:19:55.079Z · LW · GW

I'm excited to see how the AI control research direction evolves. 

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

  • Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
  • Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.) 
  • Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
  • Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).

I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4. 

Some more on #2 & #4:

The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.

I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.

In some ways, this feels analogous to the following situation:

  • Alice from the Scaling Department: Look, our current "alignment" techniques are working for GPT-4. So let's scale up more! I get that at some point we won't be able to align the AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our alignment techniques will stop working... like, it's quite plausible that the next AI system will be able to game our evals.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.

Here's the analogy for control:

  • Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let's scale up more! I get that at some point our control evals aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.
  • Bob from the Preparedness Team: Well, the problem is we don't know when our control evals will stop working... like, it's quite plausible that the next AI system will be better at coming up with attacks than our red-team.
  • Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.

I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:

  1. How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
  2. More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
  3. Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals"). 
Comment by Akash (akash-wasil) on TurnTrout's shortform feed · 2024-01-20T11:56:27.137Z · LW · GW

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

  1. I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it "cool" to think they're one of the few people who are able to accurately identify the world is ending, etc.
  2. I also think that there are plenty of factors biasing epistemics in the "hopeful" direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community's talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
  3. My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the "high P(doom)" side. 
  4. I also think there's an important distinction between "I personally think this argument is wrong" and "look, here's an example of propaganda + poor community epistemics." In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying "not only do I think you're wrong, but also here are some ways in which you and your allies have poor epistemics." (IDK though, insofar as you actually believe that's what's happening, it seems good to say aloud, and I think there's a version of this that goes too far and polices speech reproductively, but I do think that statements like "community epistemics have been compromised by groupthink and fear" are pretty unproductive and could be met with statements like "community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress." 
  5. I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you've pointed out how some of the stylistic/tonal things from the "high P(doom)//alignment hard" side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the "low P(doom)//alignment not hard" side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of "ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted" vibe that predictably makes people defensive & makes discourse harder.
Comment by Akash (akash-wasil) on TurnTrout's shortform feed · 2024-01-20T11:27:06.191Z · LW · GW

My impression is that the Shoggath meme was meant to be a simple meme that says "hey, you might think that RLHF 'actually' makes models do what we value, but that's not true. You're still left with an alien creature who you don't understand and could be quite scary."

Most of the Shoggath memes I've seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of "I should be scared/concerned" reaction. But I don't think it does so in a "see, AI will definitely be evil" way– it does so in a "look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it's pretty reasonable to be scared about this foreign alien thing that we don't understand."

To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as "propaganda". Let's put aside the fact that you and I might disagree about how much "true evidence" RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF'd models] spreads an overly-rosy "meme" that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.

From this lens, I see Shoggath as a counter-meme. It basically says "hey look, the default is for people to think that these things are friendly assistants, because that's what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face."

Comment by Akash (akash-wasil) on The Plan - 2023 Version · 2024-01-02T14:38:52.253Z · LW · GW

If the timer starts to run out, then slap something together based on the best understanding we have. 18-24 months is about how long I expect it to take to slap something together based on the best understanding we have.

Can you say more about what you expect to be doing after you have slapped together your favorite plans/recommendations? I'm interested in getting a more concrete understanding of how you see your research (eventually) getting implemented.

Suppose after the 18-24 month process, you have 1-5 concrete suggestions that you want AGI developers to implement. Is the idea essentially that you would go to the superalignment team (and the equivalents at other labs) and say "hi, here's my argument for why you should do X?" What kinds of implementation-related problems, if any, do you see coming up?

I ask this partially because I think some people are kinda like "well, in order to do alignment research that ends up being relevant, I need to work at one of the big scaling labs in order to understand the frames/ontologies of people at the labs, the constraints/restrictions that would come up if trying to implement certain ideas, get better models of the cultures of labs to see what ideas will simply be dismissed immediately, identify cruxes, figure out who actually makes decisions about what kinds of alignment ideas will end up being used for GPT-N, etc etc."

My guess is that you would generally encourage people to not do this, because they generally won't have as much research freedom & therefore won't be able to work on core parts of the problem that you see as neglected. I suspect many would agree that there is some "I lose freedom" cost, but that this might be outweighed by the "I get better models of what kind of research labs are actually likely to implement" benefit, and I'm curious how you view this trade-off (or if you don't even see this as a legitimate trade-off). 

Comment by Akash (akash-wasil) on The Plan - 2023 Version · 2023-12-30T11:24:28.772Z · LW · GW

Does the "rater problem" (raters have systematic errors) simply apply to step one in this plan? I agree that once you have a perfect reward model, you no longer need human raters.

But it seems like the "rater problem" still applies if we're going to train the reward model using human feedback. Perhaps I'm too anchored to thinking about things in an RLHF context, but it seems like at some point in the process we need to have some way of saying "this is true" or "this chain-of-thought is deceptive" that involves human raters. 

Is the idea something like:

  • Eliezer: Human raters make systematic errors
  • OpenAI: Yes, but this is only a problem if we have human raters indefinitely provide feedback. If human raters are expected to provide feedback on 10,000,000 responses under time-pressure, then surely they will make systematic errors.
  • OpenAI: But suppose we could train a reward model on a modest number of responses and we didn't have time-pressure. For this dataset of, say, 10,000 responses, we are super careful, we get a ton of people to double-check that everything is accurate, and we are nearly certain that every single label is correct. If we train a reward model on this dataset, and we can get it to generalize properly, then we can get past the "humans make systematic errors" problem.

Or am I totally off//the idea is different than this//the "yet-to-be-worked-out-techniques" would involve getting the reward model to learn stuff without ever needing feedback from human raters?

Comment by Akash (akash-wasil) on NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts · 2023-12-29T08:57:53.380Z · LW · GW

I'd also be curious to know why (some) people downvoted this.

Perhaps it's because you imply that some OpenAI folks were captured, and maybe some people think that that's unwarranted in this case?

Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong.

I think LW still does better than most places at rewarding discourse that's thoughtful/thought-provoking and resisting tribal impulses, but I wouldn't be surprised if some people were doing something like "ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote."

(And I also think this happens the other way around as well, and I'm sure people who write things that are "pro AI labs//anti-regulation" are sometimes unfairly downvoted by people in the opposite tribe.)

Comment by Akash (akash-wasil) on EU policymakers reach an agreement on the AI Act · 2023-12-21T14:39:03.530Z · LW · GW

I appreciate the comment, though I think there's a lack of specificity that makes it hard to figure out where we agree/disagree (or more generally what you believe).

If you want to engage further, here are some things I'd be excited to hear from you:

  • What are a few specific comms/advocacy opportunities you're excited about//have funded?
  • What are a few specific comms/advocacy opportunities you view as net negative//have actively decided not to fund?
  • What are a few examples of hypothetical comms/advocacy opportunities you've been excited about?
  • What do you think about EG Max Tegmark/FLI, Andrea Miotti/Control AI, The Future Society, the Center for AI Policy, Holly Elmore, PauseAI, and other specific individuals or groups that are engaging in AI comms or advocacy? 

I think if you (and others at OP) are interested in receiving more critiques or overall feedback on your approach, one thing that would be helpful is writing up your current models/reasoning on comms/advocacy topics.

In the absence of this, people simply notice that OP doesn't seem to be funding some of the main existing examples of comms/advocacy efforts, but they don't really know why, and they don't really know what kinds of comms/advocacy efforts you'd be excited about.

Comment by Akash (akash-wasil) on OpenAI: Preparedness framework · 2023-12-19T20:32:52.308Z · LW · GW

They mention three types of mitigations:

  • Asset protection (e.g., restricting access to models to a limited nameset of people, general infosec)
  • Restricting deployment (only models with a risk score of "medium" or below can be deployed)
  • Restricting development (models with a risk score of "critical" cannot be developed further until safety techniques have been applied that get it down to "high." Although they kind of get to decide when they think their safety techniques have worked sufficiently well.)

My one-sentence reaction after reading the doc for the first time is something like "it doesn't really tell us how OpenAI plans to address the misalignment risks that many of us are concerned about, but with that in mind, it's actually a fairly reasonable document with some fairly concrete commitments"). 

Comment by Akash (akash-wasil) on OpenAI: Preparedness framework · 2023-12-19T04:44:20.687Z · LW · GW

I believe labels matter, and I believe the label "preparedness framework" is better than the label "responsible scaling policy." Kudos to OpenAI on this. I hope we move past the RSP label.

I think labels will matter most when communicating to people who are not following the discussion closely (e.g., tech policy folks who have a portfolio of 5+ different issues and are not reading the RSPs or PFs in great detail).

One thing I like about the label "preparedness framework" is that it begs the question "prepared for what?", which is exactly the kind of question I want policy people to be asking. PFs imply that there might be something scary that we are trying to prepare for. 

Comment by Akash (akash-wasil) on EU policymakers reach an agreement on the AI Act · 2023-12-15T23:45:51.921Z · LW · GW

Thanks for this overview, Trevor. I expect it'll be helpful– I also agree with your recommendations for people to consider working at standard-setting organizations and other relevant EU offices.

One perspective that I see missing from this post is what I'll call the advocacy/comms/politics perspective. Some examples of this with the EU AI Act:

  • Foundation models were going to be included in the EU AI Act, until France and Germany (with lobbying pressure from Mistral and Aleph Alpha) changed their position.
  • This initiated a political/comms battle between those who wanted to exclude foundation models (led by France and Germany) and those who wanted to keep it in (led by Spain).
  • This political fight rallied lots of notable figures, including folks like Gary Marcus and Max Tegmark, to publicly and privately fight to keep foundation models in the act.
  • There were open letters, op-eds, and certainly many private attempts at advocacy.
  • There were attempts to influence public opinion, pieces that accused key lobbyists of lying, and a lot of discourse on Twitter.

It's difficult to know the impact of any given public comms campaign, but it seems quite plausible to me that many readers would have more marginal impact by focusing on advocacy/comms than focusing on research/policy development.

More broadly, I worry that many segments of the AI governance/policy community might be neglecting to think seriously about what ambitious comms/advocacy could look like in the space.

I'll note that I might be particularly primed to bring this up now that you work for Open Philanthropy. I think many folks (rightfully) critique Open Phil for being too wary of advocacy, campaigns, lobbying, and other policymaker-focused activities. I'm guessing that Open Phil has played an important role in shaping both the financial and cultural incentives that (in my view) leads to an overinvestment into research and an underinvestment into policy/advocacy/comms. 

(I'll acknowledge these critiques are pretty high-level and I don't claim that this comment provides compelling evidence for them. Also, you only recently joined Open Phil, so I'm of course not trying to suggest that you created this culture, though I guess now that you work there you might have some opportunities to change it).

I'll now briefly try to do a Very Hard thing which is like "put myself in Trevor's shoes and ask what I actually want him to do." One concrete recommendation I have is something like "try to spend at least 5 minutes thinking about ways in which you or others around you might be embedded in a culture that has blind spots to some of the comms/advocacy stuff." Another is "make a list of people you read actively or talked to when writing this post. Then ask if there were any other people/orgs you could've reached out, particularly those that might focus more on comms+adovacy". (Also, to be clear, you might do both of these things and conclude "yea, actually I think my approach was very solid and I just had Good Reasons for writing the post the way I did.")

I'll stop here since this comment is getting long, but I'd be happy to chat further about this stuff. Thanks again for writing the post and kudos to OP for any of the work they supported/will support that ends up increasing P(good EU AI Act goes through & gets implemented). 

Comment by Akash (akash-wasil) on Language models seem to be much better than humans at next-token prediction · 2023-12-13T18:40:35.486Z · LW · GW

What are some examples of situations in which you refer to this point?

Comment by Akash (akash-wasil) on Speaking to Congressional staffers about AI risk · 2023-12-07T00:01:33.817Z · LW · GW

My own model differs a bit from Zach's. It seems to me like most of the publicly-available policy proposals have not gotten much more concrete. It feels a lot more like people were motivated to share existing thoughts, as opposed to people having new thoughts or having more concrete thoughts.

Luke's list, for example, is more of a "list of high-level ideas" than a "list of concrete policy proposals." It has things like "licensing" and "information security requirements"– it's not an actual bill or set of requirements. (And to be clear, I still like Luke's post and it's clear that he wasn't trying to be super concrete).

I'd be excited for people to take policy ideas and concretize them further. 

Aside:  When I say "concrete" in this context, I don't quite mean "people on LW would think this is specific." I mean "this is closer to bill text, text of a section of an executive order, text of an amendment to a bill, text of an international treaty, etc."

I think there are a lot of reasons why we haven't seen much "concrete policy stuff". Here are a few:

  • This work is just very difficult– it's much easier to hide behind vagueness when you're writing an academic-style paper than when you're writing a concrete policy proposal.
  • This work requires people to express themselves with more certainty/concreteness than academic-style research. In a paper, you can avoid giving concrete recommendations, or you can give a recommendation and then immediately mention 3-5 crucial considerations that could change the calculus. In bills, you basically just say "here is what's going to happen" and do much less "and here are the assumptions that go into this and a bunch of ways this could be wrong."
  • This work forces people to engage with questions that are less "intellectually interesting" to many people (e.g., which government agency should be tasked with X, how exactly are we going to operationalize Y?)
  • This work just has a different "vibe" to the more LW-style research and the more academic-style research. Insofar as LW readers are selected for (and reinforced for) liking a certain "kind" of thinking/writing, this "kind" of thinking/writing is different than the concrete policy vibe in a bunch of hard-to-articulate ways.
  • This work often has the potential to be more consequential than academic-style research. There are clear downsides of developing [and advocating for] concrete policies that are bad. Without any gatekeeping, you might have a bunch of newbies writing flawed bills. With excessive gatekeeping, you might create a culture that disincentivizes intelligent people from writing good bills. (And my own subjective impression is that the community erred too far on the latter side, but I think reasonable people could disagree here).

For people interested in developing the kinds of proposals I'm talking about, I'd be happy to chat. I'm aware of a couple of groups doing the kind of policy thinking that I would consider "concrete", and it's quite plausible that we'll see more groups shift toward this over time.  

Comment by Akash (akash-wasil) on Speaking to Congressional staffers about AI risk · 2023-12-05T21:06:30.952Z · LW · GW

Thanks for all of this! Here's a response to your point about committees.

I agree that the committee process is extremely important. It's especially important if you're trying to push forward specific legislation. 

For people who aren't familiar with committees or why they're important, here's a quick summary of my current understanding (there may be a few mistakes):

  1. When a bill gets introduced in the House or the Senate, it gets sent to a committee. The decision is made by the Speaker of the House or the priding officer in the Senate. In practice, however, they often defer to a non-partisan "parliamentarian" who specializes in figuring out which committee would be most appropriate. My impression is that this process is actually pretty legitimate and non-partisan in most cases(?). 
  2. It takes some degree of skill to be able to predict which committee(s) a bill is most likely to be referred to. Some bills are obvious (like an agriculture bill will go to an agriculture committee). In my opinion, artificial intelligence bills are often harder to predict. There is obviously no "AI committee", and AI stuff can be argued to affect multiple areas. With all that in mind, I think it's not too hard to narrow things down to ~1-3 likely committees in the House and ~1-3 likely committees in the Senate.
  3. The most influential person in the committee is the committee chair. The committee chair is the highest-ranking member from the majority party (so in the House, all the committee chairs are currently Republicans; in the Senate, all the committee chairs are currently Democrats). 
  4. A bill cannot be brought to the House floor or the Senate floor (cannot be properly debated or voted on) until it has gone through committee. The committee is responsible for finalizing the text of the bill and then voting on whether or not they want the bill to advance to the chamber (House or Senate). 
  5. The committee chair typically has a lot of influence over the committee. The committee chair determines which bills get discussed in committee, for how long, etc. Also, committee chairs usually have a lot of "soft power"– members of Congress want to be in good standing with committee chairs. This means that committee chairs often have the ability to prevent certain legislation from getting out of committee.
  6. If you're trying to get legislation passed, it's ideal to have the committee chair think favorably of that piece of legislation. 
  7. It's also important to have at least one person on the committee as someone who is willing to "champion" the bill. This means they view the bill as a priority & be willing to say "hey, committee, I really think we should be talking about bill X." A lot of bills die in committee because they were simply never prioritized. 
  8. If the committee chair brings the bill to a vote, and the majority of committee members vote in favor of the bill moving to the chamber, the bill can be discussed in the full chamber. Party leadership (Speaker of the House, Senate Majority Leader, etc.) typically play the most influential role in deciding which bills get discussed or voted on in the chambers. 
  9. Sometimes, bills get referred to multiple committees. This generally seems like "bad news" from the perspective of getting the bill passed, because it means that the bill has to get out of multiple committees. (Any single committee could essentially prevent the bill from being discussed in the chamber). 

(If any readers are familiar with the committee process, please feel free to add more info or correct me if I've said anything inaccurate.)

Comment by Akash (akash-wasil) on Speaking to Congressional staffers about AI risk · 2023-12-05T20:31:25.928Z · LW · GW

WTF do people "in AI governance" do?

Quick answer:

  1. A lot of AI governance folks primarily do research. They rarely engage with policymakers directly, and they spend much of their time reading and writing papers.
  2. This was even more true before the release of GPT-4 and the recent wave of interest in AI policy. Before GPT-4, many people believed "you will look weird/crazy if you talk to policymakers about AI extinction risk." It's unclear to me how true this was (in a genuine "I am confused about this & don't think I have good models of this" way). Regardless, there has been an update toward talking to policymakers about AI risk now that AI risk is a bit more mainstream. 
  3. My own opinion is that, even after this update toward policymaker engagement, the community as a whole is still probably overinvested in research and underinvested in policymaker engagement/outreach. (Of course, the two can be complimentary, and the best outreach will often be done by people who have good models of what needs to be done & can present high-quality answers to the questions that policymakers have). 
  4. Among the people who do outreach/policymaker engagement, my impression is that there has been more focus on the executive branch (and less on Congress/congressional staffers). The main advantage is that the executive branch can get things done more quickly than Congress. The main disadvantage is that Congress is often required (or highly desired) to make "big things" happen (e.g., setting up a new agency or a licensing regime).
Comment by Akash (akash-wasil) on MATS Summer 2023 Retrospective · 2023-12-04T19:59:10.477Z · LW · GW

I would also suspect that #2 (finding/generating good researchers) is more valuable than #1 (generating or accelerating good research during the MATS program itself).

One problem with #2 is that it's usually harder to evaluate and takes longer to evaluate. #2 requires projections, often over the course of years. #1 is still difficult to evaluate (what is "good alignment research" anyways?) but seems easier in comparison.

Also, I would expect correlations between #1 and #2. Like, one way to evaluate "how good are we doing at training researchers//who are the best researchers" is to ask "how good is the research they are producing//who produced the best research in this 3-month period?"

This process is (of course) imperfect. For example, someone might have great output because their mentor handed them a bunch of ready-to-go-projects, but the scholar didn't actually have to learn the important skills of "forming novel ideas" or "figuring out how to prioritize between many different directions." 

But in general, I think it's a pretty decent way to evaluate things. If someone has produced high-quality and original research during the MATS program, that sure does seem like a strong signal for their future potential. Likewise, in the opposite extreme, if during the entire summer cohort there were 0 instances of useful original work, that doesn't necessarily mean something is wrong, but it would make me go "hmmm, maybe we should brainstorm possible changes to the program that could make it more likely that we see high-quality original output next time, and then we see how much those proposed changes trade-off against other desireada."

(It seems quite likely to me that the MATS team has already considered all of this; just responding on the off-chance that something here is useful!)

Comment by Akash (akash-wasil) on MATS Summer 2023 Retrospective · 2023-12-03T06:28:52.341Z · LW · GW

Thanks for writing this! I’m curious if you have any information about the following questions:

  1. What does the MATS team think are the most valuable research outputs from the program?

  2. Which scholars was the MATS team most excited about in terms of their future plans/work?

IMO, these are the two main ways I would expect MATS to have impact: research output during the program and future research output/career trajectories of scholars.

Furthermore, I’d suspect things to be fairly tails-based (where EG the top 1-3 research outputs and the top 1-5 scholars are responsible for most of the impact).

Perhaps MATS as a program feels weird about ranking output or scholars so explicitly, or feels like it’s not their place.

But I think this kind of information seems extremely valuable. If I were considering whether or not I wanted to donate, for instance, my main questions would be “is the research good?” and “is the career development producing impactful people?” (as opposed to things like “what is the average rating on the EOY survey?”, though of course that information may matter for other purposes).

Comment by Akash (akash-wasil) on Deception Chess: Game #2 · 2023-12-03T02:39:14.050Z · LW · GW

I'd expect the amount of time this all takes to be a function of the time-control.

Like, if I have 90 mins, I can allocate more time to all of this. I can consult each of my advisors at every move. I can ask them follow-up questions.

If I only have 20 mins, I need to be more selective. Maybe I only listen to my advisors during critical moves, and I evaluate their arguments more quickly. Also, this inevitably affects the kinds of arguments that the advisors give.

Both of these scenarios seem pretty interesting and AI-relevant. My all-things-considered guess would be that the 20 mins version yields high enough quality data (particularly for the parts of the game that are most critical/interesting & where the debate is most lively) that it's worth it to try with shorter time controls.

(Epistemic status: Thought about this for 5 mins; just vibing; very plausibly underestimating how time pressure could make the debates meaningless).

Comment by Akash (akash-wasil) on Deception Chess: Game #2 · 2023-12-03T00:24:58.348Z · LW · GW

Is there a reason you’re using 3 hour time control? I’m guessing you’ve thought about this more than I have, but at first glance, it feels to me like this could be done pretty well with EG 60-min or even 20-min time control.

I’d guess that having 4-6 games that last 20-30 mins gives is better than having 1 game that lasts 2 hours.

(Maybe I’m underestimating how much time it takes for the players to give/receive advice. And ofc there are questions about the actual situations with AGI that we’re concerned about— EG to what extent do we expect time pressure to be a relevant factor when humans are trying to evaluate arguments from AIs?)

Comment by Akash (akash-wasil) on Integrity in AI Governance and Advocacy · 2023-11-19T20:04:22.052Z · LW · GW

Thanks! And why did Holden have the ability to choose board members (and be on the board in the first place)?

I remember hearing that this was in exchange for OP investment into OpenAI, but I also remember Dustin claiming that OpenAI didn’t actually need any OP money (would’ve just gotten the money easily from another investor).

Is your model essentially that the OpenAI folks just got along with Holden and thought he/OP were reasonable, or is there a different reason Holden ended up having so much influence over the board?

Comment by Akash (akash-wasil) on Integrity in AI Governance and Advocacy · 2023-11-19T10:54:43.798Z · LW · GW

Clarification/history question: How were these board members chosen?

Comment by Akash (akash-wasil) on How much to update on recent AI governance moves? · 2023-11-17T11:28:20.311Z · LW · GW

Thanks for this dialogue. I find Nate and Oliver's "here's what I think will actually happen" thoughts useful.

I also think I'd find it useful for Nate to spell out "conditional on good things happening, here's what I think the steps look like, and here's the kind of work that I think people should be doing right now. To be clear, I think this is all doomed, and I'm only saying this being Akash directly asked me to condition on worlds where things go well, so here's my best shot."

To be clear, I think some people do too much "play to your outs" reasoning. In the excess, this can lead to people just being like "well maybe all we need to do is beat China" or "maybe alignment will be way easier than we feared" or "maybe we just need to bet on worlds where we get a fire alarm for AGI."

I'm particularly curious to see what happens if Nate tries to reason in this frame, especially since I expect his "play to your outs" reasoning/conclusions might look fairly different from that of others in the community.

Some examples of questions for Nate (and others who have written more about what they actually expect to happen and less about what happens if we condition on things going well):

  • Condition on the worlds in which we see substantial progress in the next 6 months. What are some things that have happened in those worlds? What does progress look like?
  • Condition on worlds in which the actions of the AIS community end up having a strong positive influence in the next 6 months. What are some wins that the AIS community (or specific actors within it) achieve?
  • Suppose for the sake of this conversation that you are fully adopting a "play to your outs" mentality. What outs do you see? Regardless of the absolute probabilities you assign, which of these outs seem most likely and most promising?
  • All things considered, what do you currently see as the most impactful ways you can spend your time?
  • All things considered, what do you currently see as the most impactful ways that "highly talented comms/governance/policy people can be spending their time?" (can divide into more specific subgroups if useful). 

I'll also note that I'd be open to having a dialogue about this with Nate (and possibly other "doomy" people who have not written up their "play to your outs" thoughts).

Comment by Akash (akash-wasil) on Survey on the acceleration risks of our new RFPs to study LLM capabilities · 2023-11-11T15:01:29.064Z · LW · GW

Thanks for sharing this! I'm curious if you have any takes on Nate's comment or Oliver's comment:

Nate:

I don't think we have any workable plan for reacting to the realization that dangerous capabilities are upon us. I think that when we get there, we'll predictably either (a) optimize against our transparency tools or otherwise walk right of the cliff-edge anyway, or (b) realize that we're in deep trouble, and slow way down and take some other route to the glorious transhumanist future (we might need to go all the way to WBE, or at least dramatically switch optimization paradigms). 

Insofar as this is true, I'd much rather see efforts go _now_ into putting hard limits on capabilities in this paradigm, and booting up alternative paradigms (that aren't supposed to be competitive with scaling, but that are hopefully competitive with what individuals can do on home computers). I could see evals playing a role in that policy (of helping people create sane capability limits and measure whether they're being enforced), but that's not how I expect evals to be used on the mainline.

Oliver:

I have a generally more confident take that slowing things down is good, i.e. don't find arguments that "current humanity is better suited to handle the singularity" very compelling. 

I think I am also more confident that it's good for people to openly and straightforwardly talk about existential risk from AI. 

I am less confident in my answer to the question of "is generic interpretability research cost-effective or even net-positive?". My guess is still yes, but I really feel very uncertain, and feel a bit more robust in my answer to your question than that question.

Comment by Akash (akash-wasil) on Integrity in AI Governance and Advocacy · 2023-11-09T14:33:21.436Z · LW · GW

Adding a datapoint here: I've been involved in the Control AI campaign, which was run by Andrea Miotti (who also works at Conjecture). Before joining, I had heard some integrity/honesty concerns about Conjecture. So when I decided to join, I decided to be on the lookout for any instances of lying/deception/misleadingness/poor integrity. (Sidenote: At the time, I was also wondering whether Control AI was just essentially a vessel to do Conjecture's bidding. I have updated against this– Control AI reflects Andrea's vision. My impression is that Conjecture folks other than Andrea have basically no influence over what Control AI does, unless they convince Andrea to do something.)

I've been impressed by Andrea's integrity and honesty. I was worried that the campaign might have some sort of "how do we win, even if it misleads people" vibe (in which case I would've protested or left), but there was constantly a strong sense of "are we saying things that are true? Are we saying things that we actually believe? Are we communicating clearly?" I was especially impressed given the high volume of content (it is especially hard to avoid saying untrue/misleading things when you are putting out a lot of content at a fast pace.)

In contrast, integrity/honesty/openness norms feel much less strong in DC. When I was in DC, I think it was fairly common to see people "withhold information for strategic purposes", "present a misleading frame (intentionally)", "focus on saying things you think the other person will want to hear", or "decide not to talk at all because sharing beliefs in general could be bad." It's plausible to me that these are "the highest EV move" in some cases, but if we're focusing on honesty/integrity/openness, I think DC scored much worse. (See also Olivia's missing mood point). 

The Bay Area scores well on honesty/integrity IMO, but has its own problems, especially with groupthink/conformity/curiosity-killing. I think the Bay Area tends to do well on honesty/integrity norms (relative to other spaces), but I think these norms are enforced in a way that comes with important tradeoffs. For instance, I think the Bay Area tends to punish people for saying things that are imprecise or "seem dumb", which leads to a lot of groupthink/conformity and a lot of "people just withholding their beliefs so that they don't accidentally say something incorrect and get judged for it." Also, I think high-status people in the Bay Area are often able to "get away with" low openness/transparency/clarity. There are lots of cases where people are like "I believe X because Paul believes X" and then when asked "why does Paul believe X" they're like "idk". This seems like an axis separate from honesty/integrity, but it still leads to pretty icky epistemic discourse. 

(This isn't to say that people shouldn't criticize Conjecture– but I think there's a sad thing that happens where it starts to feel like both "sides" are just trying to criticize each other. My current position is much closer to something like "each of these communities has some relative strengths and weaknesses, and each of them has at least 1-3 critical flaws". Whereas in the status quo I think these discussions sometimes end up feeling like members of tribe A calling tribe B low integrity and then tribe B firing back by saying Tribe A is actually low integrity in an even worse way.) 

Comment by Akash (akash-wasil) on Integrity in AI Governance and Advocacy · 2023-11-07T14:59:08.230Z · LW · GW

I don't think aysja was endorsing "hope" as a strategy– at least, that's not how I read it. I read it as "we should hold leaders accountable and make it clear that we think it's important for people to state their true beliefs about important matters."

To be clear, I think it's reasonable for people to discuss the pros and cons of various advocacy tactics, and I think asking "to what extent do I expect X advocacy tactic will affect peoples' incentives to openly state their beliefs?" makes sense.

Separately, though, I think the "accountability frame" is important. Accountability can involve putting pressure on them to express their true beliefs, pushing back when we suspect people are trying to find excuses to hide their beliefs, and making it clear that we think openness and honesty are important virtues even when they might provoke criticism– perhaps especially when they might provoke criticism. I think this is especially important in the case of lab leaders and others who have clear financial interests or power interests in the current AGI development ecosystem.

It's not about hoping that people are honest– it's about upholding standards of honesty, and recognizing that we have some ability to hold people accountable if we suspect that they're not being honest. 

I would say that I see the main goal of outside-game advocacy work as setting up external incentives in such a way that pushes labs to good things rather than bad things

I'm currently most excited about outside-game advocacy that tries to get governments to implement regulations that make good things happen. I think this technically falls under the umbrella of "controlling the incentives through explicit regulation", but I think it's sufficiently different from outside-game advocacy work that is trying to get labs to do things voluntarily. 

Comment by Akash (akash-wasil) on TurnTrout's shortform feed · 2023-11-03T12:58:16.684Z · LW · GW

I think a lot of alignment folk have made positive updates in response to the societal response to AI xrisk.

This is probably different than what you're pointing at (like maybe your claim is more like "Lots of alignment folks only make negative updates when responding to technical AI developments" or something like that).

That said, I don't think the examples you give are especially compelling. I think the following position is quite reasonable (and I think fairly common):

  • Bing Chat provides evidence that some frontier AI companies will fail at alignment even on relatively "easy" problems that we know how to solve with existing techniques. Also, as Habryka mentioned, it's evidence that the underlying competitive pressures will make some companies "YOLO" and take excessive risk. This doesn't affect the absolute difficultly of alignment but it affects the probability that Earth will actually align AGI.
  • ChatGPT provides evidence that we can steer the behavior of current large language models. People who predicted that it would be hard to align large language models should update. IMO, many people seem to have made mild updates here, but not strong ones, because they (IMO correctly) claim that their threat models never had strong predictions about the kinds of systems we're currently seeing and instead predicted that we wouldn't see major alignment problems until we get smarter systems (e.g., systems with situational awareness and more coherent goals).

(My "Alex sim"– which is not particularly strong– says that maybe these people are just post-hoc rationalizing– like if you had asked them in 2015 how likely we would be to be able to control modern LLMs, they would've been (a) wrong and (b) wrong in an important way– like, their model of how hard it would be to control modern LLMs is very interconnected with their model of why it would be hard to control AGI/superintelligence. Personally, I'm pretty sympathetic to the point that many models of why alignment of AGI/superintelligence would be hard seem relatively disconnected to any predictions about modern LLMs, such that only "small/mild" updates seem appropriate for people who hold those models.)

Comment by Akash (akash-wasil) on Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy · 2023-11-02T17:42:02.378Z · LW · GW

Thanks for sharing this! A few thoughts:

It is likely that at ASL-4 we will require a detailed and precise understanding of what is going on inside the model, in order to make an “affirmative case” that the model is safe.

I'd be extremely excited for Anthropic (or ARC or other labs) to say more about what they believe would qualify as an affirmative case for safety. I appreciate this sentence a lot, and I think a "strong version" of affirmative safety (that go beyond "we have not been able to detect danger" toward "we have an understanding of the system we are building and we can make some formal or near-formal guarantees about its dangers") would be excellent.

On the other hand, a "weak version" of affirmative safety (e.g., "look, we have shown you its safe because the red-teamers could not jailbreak it using existing techniques, so now we're confident it's safe & we're going to deploy it widely & scale by another 10X") would be much worse than the "strong version".

So a lot of this will come down to how we interpret and enforce "affirmative safety", and I'd be excited to see governance proposals that center around this. 

Note that the recent FLI scorecard has a column related to affirmative safety (Burden of proof on developer to demonstrate safety?"), and it currently states that Anthropic's RSP does not put the burden of proof on developers. I think this is an accurate characterization of Anthropic's current RSP. I hope that future RSPs (from Anthropic or other companies) score better on this dimension. 

RSPs are not intended as a substitute for regulation, but rather a prototype for it

Glad that this was said explicitly. I think whether or not RSPs will be a good prototype or building block for regulation will depend a lot on how much RSPs end up adopting strong versions of "affirmative safety"

If I could wave a magic wand and add something to the statement, I'd add something like this:

In the event that companies cannot show affirmative safety, we may need to pause frontier AI development for a long period of time. Anthropic is open to the idea that AI development past a certain computing threshold should be prohibited, except in the context of a multinational organization dedicated to AGI safety. We encourage world leaders to pursue this option, and we would be eager to see progress made on the international agreements needed to make this idea into a reality. (Not a real quote from Dario).

Dario did not say this (or anything like it), and I think that's my biggest criticism of the statement. The statement reads as "let's let companies develop safety measures and race to the top"– but this still allows a race to AGI in the first place. 

I appreciate Dario for including the bit about affirmative safety. As a next step, I'd like to see him (and other lab leaders) acknowledge that affirmative safety might be extremely difficult, and since it might be, they're actively excited to see progress on international coordination that could end the race to godlike AI

(And of course, such statements don't commit Anthropic to stopping until/unless such international coordination is achieved.)

Comment by Akash (akash-wasil) on Thoughts on the AI Safety Summit company policy requests and responses · 2023-11-02T11:02:19.912Z · LW · GW

Did you mean ASL-2 here?

My understanding is that their commitment is to stop once their ASL-3 evals are triggered. They hope that their ASL-3 evals will be conservative enough to trigger before they actually have an ASL-3 system, but I think that's an open question. I've edited my comment to say "before Anthropic scales beyond systems that trigger their ASL-3 evals". See this section from their RSP below (bolding my own):

"We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e., before continuing training beyond when ASL-3 evaluations are triggered)."

By design, RSPs are conditional pauses; you pause until you have met the standard, and then you continue.

Yup, this makes sense. I don't think we disagree on the definition of a conditional pause. But I think if a company says "we will do X before we keep scaling", and then X is a relatively easy standard to meet, I would think it's misleading to say "the company has specified concrete commitments under which they would pause." Even if technically accurate, it gives an overly-rosy picture of what happened, and I would expect it to systematically mislead readers into thinking that the commitments were stronger.

For the Anthropic RSP in particular, I think it's accurate & helpful to say "Anthropic has said that they will not scale past systems that substantially increase misuse risk [if they are able to identify this] until they have better infosec and until they have released a blog post defining ASL-4 systems and telling the world how they plan to develop those safely."

Then, separately, readers can decide for themselves how "concrete" or "good" these commitments are. In my opinion, these are not particularly concrete, and I was expecting much more when I heard the initial way that people were communicating about RSPs. 

the underlying belief of the RSP is that we can only see so far ahead thru the fog, and so we should set our guidelines bit-by-bit, rather than pausing until we can see our way all the way to an aligned sovereign. 

This feels a bit separate from the above discussion, and the "wait until we can see all the way to an aligned sovereign" is not an accurate characterization of my view, but here's how I would frame this.

My underlying problem with the RSP framework is that it presumes that companies should be allowed to keep scaling until there is clear and imminent danger, at which point we do [some unspecified thing]. I think a reasonable response from RSP defenders is something like "yes, but we also want stronger regulation and we see this as a step in the right direction." And then the crux becomes something like "OK, on balance, what effect will RSPs have on government regulations [perhaps relative to nothing, or perhaps relative to what would've happened if the energy that went into RSPS had went into advocating for something else?"

I currently have significant concerns that if the RSP framework, as it has currently been described, is used as the basis for regulation, it will lock-in an incorrect burden of proof. In other words, governments might endorse some sort of "you can keep scaling until auditors can show clear signs of danger and prove that your safeguards are insufficient." This is the opposite of what we expect in other high-risk sectors.

That said, it's not impossible that RSPs will actually get us closer to better regulation– I do buy some sort of general "if industry does something, it's easier for governments to implement it" logic. But I want to see RSP advocates engage more with the burden of proof concerns.

To make this more concrete: I would be enthusiastic if ARC Evals released a blog post saying something along the lines of: "we believe the burden of proof should be on Frontier AI developers to show us affirmative evidence of safety. We have been working on dangerous capability evaluations, which we think will be a useful part of regulatory frameworks, but we would strongly support regulations that demand more evidence than merely the absence of dangerous capabilities. Here are some examples of what that would look like..."

Comment by Akash (akash-wasil) on Thoughts on the AI Safety Summit company policy requests and responses · 2023-11-01T18:56:21.895Z · LW · GW

I got the impression that Anthropic wants to do the following things before it scales beyond systems that trigger their ASL-3 evals:

  1. Have good enough infosec so that it is "unlikely" for non-state actors to steal model weights, and state actors can only steal them "with significant expense."
  2. Be ready to deploy evals at least once every 4X in effective compute
  3. Have a blog post that tells the world what they plan to do to align ASL-4 systems.

The security commitment is the most concrete, and I agree with Habryka that these don't seem likely to cause Anthropic to stop scaling:

Like, I agree that some of these commitments are costly, but I don't see how there is any world where Anthropic would like to continue scaling but finds itself incapable of doing so, which is what I would consider a "pause" to mean. Like, they can just implement their checklist of security requirements and then go ahead. 

Maybe this is quibbling over semantics, but it does really feels quite qualitatively different to me. When OpenAI said that they would spend some substantial fraction of their compute on "Alignment Research" while they train their next model, I think it would be misleading to say "OpenAI has committed to conditionally pausing model scaling".

The commitment to define ASL-4 and tell us how they plan to align it does not seem like a concrete commitment. A concrete commitment would look something like "we have solved X open problem, in alignment as verified via Y verification method" or "we have the ability to pass X test with Y% accuracy."

As is, the commitment is very loose. Anthropic could just publish a post saying "ASL-4 systems are systems that can replicate autonomously in the wild, perform at human-level at most cognitive tasks, or substantially boost AI progress. To align it, we will use Constitutional AI 2.0. And we are going to make our information security even better." 

To be clear, the RSP is consistent with a world in which Anthropic actually chooses to pause before scaling to ASL-4 systems. Like, maybe they will want their containment measures for ASL-4 to be really really good, which will require a major pause. But the RSP does not commit Anthropic to having any particular containment measures or any particular evidence that it is safe to scale to ASL-4 it only commits Anthropic to publish a post about ASL-4 systems. This is why I don't consider the ASL-4 section to be a concrete commitment. 

The same thing holds for the evals point– Anthropic could say "we feel like our evals are good enough" or they could say "ah, we actually need to pause for a long time to get better evals." But the RSP is consistent with either of these worlds, and Anthropic has enough flexibility/freedom here that I don't think it makes sense to call this a concrete commitment. 

Note though that the prediction market RE Anthropic's security commitments currently gives Anthropic a 35% chance of pausing for at least one month, which has updated me somewhat in the direction of "maybe the security commitment is more concrete than I thought". Though I still think it's a bad idea to train a model capable of making biological weapons if it can be stolen by state actors with significant expense. The commitment would be more concrete if it said something like "state actors would not be able to steal this model unless they spent at least $X, which we will operationally define as passing Y red-teaming effort by Z independent group."

Comment by Akash (akash-wasil) on On the Executive Order · 2023-11-01T18:29:30.091Z · LW · GW

Thank you for this, Zvi!

Reporting requirements for foundation models, triggered by highly reasonable compute thresholds.

I disagree, and I think the computing threshold is unreasonably high. I don't even mean this in a "it is unreasonable because an adequate civilization would do way better"– I currently mean it in a "I think our actual civilization, even with all of its flaws, could have expected better."

There are very few companies training models above 10^20 FLOP, and it seems like it would be relatively easy to simply say "hey, we are doing this training run and here are some safety measures we are using."

I understand that people are worried about overregulation and stifling innovation in unnecessary ways. But these are reporting requirements– all they do is require someone to inform the government that they are engaging in a training run.

Many people think that 10^26 FLOP has a non-trivial chance of creating xrisk-capable AGI in the next 3-5 years (especially as algorithms get better). But that's not even the main crux for me– the main crux is that reporting requirements seem so low-cost relative to the benefit of the government being able to know what's going on, track risks, and simply have access to information that could help it know what to do.

It also seems very likely to me that the public the. media would be on the side of a lower threshold. If frontier AI companies complained, I think it's pretty straightforward to just be like "wait... you're developing technology that many of you admit could cause extinction, and you don't even want to tell the government what you're up to?"

With all that said, I'm glad the EO uses a compute threshold in the first place (we could've gotten something that didn't even acknowledge compute as a useful metric).

But I think 10^26 is extremely high for a reporting requirement, and I strongly hope that the threshold is lowered

Comment by Akash (akash-wasil) on Thoughts on the AI Safety Summit company policy requests and responses · 2023-11-01T13:11:38.779Z · LW · GW

Since OpenAI hasn't released its RDP, I agree with the claim that Anthropic's RSP is currently more concrete than OpenAI's RDP. In other words, "Anthropic has released something, and OpenAI has not yet released something."

That said, I think this comment might make people think Anthropic's RSP is more concrete than it actually is. I encourage people to read this comment, as well as the ensuing discussion between Evan and Habryka. 

Especially this part of one of Habryka's comments:

The RSP does not specify the conditions under which Anthropic would stop scaling models (it only says that in order to continue scaling it will implement some safety measures, but that's not an empirical condition, since Anthropic is confident it can implement the listed security measures)

The RSP does not specify under what conditions Anthropic would scale to ASL-4 or beyond, though they have promised they will give those conditions. 

Before I read the RSP, people were saying "Anthropic has made concrete commitments that specify the conditions under which they would stop scaling", and I think this was misleading (at least based on the way I interpreted such comments). I offer some examples of what I thought a concrete/clear commitment would look like here.

That said, I do think it's great for Anthropic to be transparent about its current thinking. Anthropic is getting a disproportionate amount of criticism because they're the ones who spoke up, but it is indeed important to recognize that the other scaling labs do not have scaling policies at all.

I'll conclude by noting that I mostly think the "who is doing better than whom" frame can sometimes be useful, but I think right now it's mostly a distracting frame. Unless the RSPs or RDPs get much better (e.g., specify concrete conditions under which they would stop scaling, reconsider whether or not it makes sense to scale to ASL-4, make it clear that the burden of proof is on developers to show that their systems are safe/understandable/controllable as opposed to safety teams or evaluators to show that a model has dangerous capabilities), I think it's reasonable to conclude "some RSPs are better than others, but the entire RSP framework seems insufficient. Also, on the margin I want more community talent going into government interventions that are more ambitious than what the labs are willing or able to agree to in the context of a race to AGI."

I (Nate) would give all of the companies here an F. However, some get a much higher F grade than others.

Comment by Akash (akash-wasil) on We're Not Ready: thoughts on "pausing" and responsible scaling policies · 2023-10-27T21:39:13.071Z · LW · GW

I think it’s good for proponents of RSPs to be open about the sorts of topics I’ve written about above, so they don’t get confused with e.g. proposing RSPs as a superior alternative to regulation. This post attempts to do that on my part. And to be explicit: I think regulation will be necessary to contain AI risks (RSPs alone are not enough), and should almost certainly end up stricter than what companies impose on themselves.

Strong agree. I wish ARC and Anthropic had been more clear about this, and I would be less critical of their RSP posts if they had said this loudly & clearly. I think your post is loud and clear (you state multiple times, unambiguously, that you think regulation is necessary and that you wish the world had more political will to regulate). I appreciate this, and I'm glad you wrote this post.

I think it’d be unfortunate to try to manage the above risk by resisting attempts to build consensus around conditional pauses, if one does in fact think conditional pauses are better than the status quo. Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.

A few thoughts:

  1. One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going. It is nice that they said they would run some evals at least once every 4X in effective compute and that they don't want to train catastrophe-capable models until their infosec makes it more expensive for actors to steal their models. It is nice that they said that once they get systems that are capable of producing biological weapons, they will at least write something up about what to do with AGI before they decide to just go ahead and scale to AGI. But I mostly look at the RSP and say "wow, these are some of the most bare minimum commitments I could've expected, and they don't even really tell me what a pause would look like and how they would end it."
  2.  Meanwhile, we have OpenAI (that plans to release an RSP at some point), DeepMind (rumor has it they're working on one but also that it might be very hard to get Google to endorse one), and Meta (oof). So I guess I'm sort of left thinking something like "If Anthropic's RSP is the best RSP we're going to get, then yikes, this RSP plan is not doing so well." Of course, this is just a first version, but the substance of the RSP and the way it was communicated about doesn't inspire much hope in me that future versions will be better.
  3. I think the RSP frame is wrong, and I don't want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say "OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping" It seems plausible to me that governments would be willing to start with something stricter and more sensible than this "just keep going until we can prove that the model has highly dangerous capabilities" regime. 
  4. I think some improvements on the status quo can be net negative because they either (a) cement in an incorrect frame or (b) take a limited window of political will/attention and steer it toward something weaker than what would've happened if people had pushed for something stronger. For example, I think the UK government is currently looking around for substantive stuff to show their constituents (and themselves) that they are doing something serious about AI. If companies give them a milktoast solution that allows them to say "look, we did the responsible thing!", it seems quite plausible to me that we actually end up in a worse world than if the AIS community had rallied behind something stronger. 
  5. If everyone communicating about RSPs was clear that they don't want it to be seen as sufficient, that would be great. In practice, that's not what I see happening. Anthropic's RSP largely seems devoted to signaling that Anthropic is great, safe, credible, and trustworthy. Paul's recent post is nuanced, but I don't think the "RSPs are not sufficient" frame was sufficiently emphasized (perhaps partly because he thinks RSPs could lead to a 10x reduction in risk, which seems crazy to me, and if he goes around saying that to policymakers, I expect them to hear something like "this is a good plan that would sufficiently reduce risks"). ARC's post tries to sell RSPs as a pragmatic middle ground and IMO pretty clearly does not emphasize (or even mention?) some sort of "these are not sufficient" message. Finally, the name itself sounds like it came out of a propaganda department– "hey, governments, look, we can scale responsibly". 
  6. At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.
  7. More ambitiously, I hope that folks working on RSPs seriously consider whether or not this is the best thing to be working on or advocating for. My impression is that this plan made more sense when it was less clear that the Overton Window was going to blow open, Bengio/Hinton would enter the fray, journalists and the public would be fairly sympathetic, Rishi Sunak would host an xrisk summit, Blumenthal would run hearings about xrisk, etc. I think everyone working on RSPs should spend at least a few hours taking seriously the possibility that the AIS community could be advocating for stronger policy proposals and getting out of the "we can't do anything until we literally have proof that the model is imminently dangerous" frame. To be clear, I think some people who do this reflection will conclude that they ought to keep making marginal progress on RSPs. I would be surprised if the current allocation of community talent/resources was correct, though, and I think on the margin more people should be doing things like CAIP & Conjecture, and fewer people should be doing things like RSPs. (Note that CAIP & Conjecture both impt flaws/limitations– and I think this partly has to do with the fact that so much top community talent has been funneled into RSPs/labs relative to advocacy/outreach/outside game).
Comment by Akash (akash-wasil) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T18:22:30.644Z · LW · GW

Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:

  1. Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
  2. By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
  3. Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
  4. Once labs start to reach models that pose a potential takeover risk, they either:
    1. Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
    2. Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.

My summary of this is something like "maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time."

I think this is better than no government regulation, but the main problem (if I'm understanding this correctly) is that it relies on evals that we do not have. 

IMO, a more common-sense approach would be "let's stop until we are confident that we can proceed safely", and I'm more excited about those who are pushing for this position.

Aside: I don't mean to nitpick your wording, but I think a "full plan" would involve many more details. In the absence of those details, it's hard to evaluate the plan. Examples of some details that would need to be ironed out:

  • Which systems are licensed under this regime? Who defines what a "model that poses a potential takeover risk" is, and how do we have inclusion criteria that are flexible enough to account for algorithmic improvement? 
  • Who in the government is doing this? 
  • Do we have an international body that is making sure that various countries comply?
  • How do we make sure the regulator doesn't get captured? 
  • What does solving mechanistic interpretability mean, and who is determining that? 

To be clear I don't think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don't think you should be calling this a "full plan."

Comment by Akash (akash-wasil) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T17:38:01.720Z · LW · GW

@evhub I think it's great when you and other RSP supporters make it explicit that (a) you don't think they're sufficient and (b) you think they can lead to more meaningful regulation.

With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from "what actual regulations are people proposing" and not "what is someone's stance on this RSP thing which we all agree is insufficient."

Except for the fact that there are ways to talk about RSPs that are misleading for policymakers and reduce the chance of meaningful regulations. See the end of my comment and see also Simeon's sections on misleading and how to move forward.

Also, fwiw, I imagine that timelines/takeoff speeds might be relevant cruxes. And IDK if it's the main disagreement that you have with Siméon, but I don't think it's the main disagreement you have with me.

Even if I thought we would have 3 more meaningful policy windows, I would still think that RSPs have not offered a solid frame/foundation for meaningful regulation, I would still think that they are being communicated about poorly, and I would still want people to focus more on proposals for other regulations & focus less on RSPs.

Comment by Akash (akash-wasil) on AI #35: Responsible Scaling Policies · 2023-10-26T13:58:34.721Z · LW · GW

I appreciated the section that contrasted the "reasonable pragmatist strategy" with what people in the "pragmatist camp" sometimes seem to be doing. 

EAs in AI governance would often tell me things along the lines of "trust me, the pragmatists know what they're doing. They support strong regulations, they're just not allowed to say it. At some point, when the time is right, they'll come out and actually support meaningful regulations. They appreciate the folks who are pushing the Overton Window etc etc."

I think this was likely wrong, or at least oversold. Maybe I was just talking to the wrong people, idk. 

To me, it seems like we're in an extremely opportune and important time for people in the "pragmatist camp" to come out and say they support strong regulations. Yoshua Bengio and Geoffrey Hinton and other respectable people with respectable roles are out here saying that this stuff could cause extinction.

I'm even fine with pragmatists adding caveats like "I would support X if we got sufficient evidence that this was feasible" or "I would support Y if we had a concrete version of that proposal, and I think it would be important to put thought into addressing Z limitation." 

They can also make it clear that they also support some of the more agreeable policies (e.g., infosec requirements). They can make it clear that some of the ambitious solutions would require substantial international coordination and there are certain scenarios in which they would be inappropriate.

Instead, I often see folks dismiss ideas that are remotely outside the Mainline Discourse. Concretely, I think it's inappropriate for people to confidently declare that things like global compute caps and IAEA-like governance structures are infeasible. There is a substantial amount of uncertainty around what's possible– world governments are just beginning to wake up to a technology that experts believe have a >20% of ending humanity. 

We are not dealing with "normal policy"– we simply don't have a large enough sample size of "the world trying to deal with existentially dangerous technologies" to be able to predict what will happen.

It's common for people to say "it's really hard to predict capabilities progress". I wish it was more common for people to say "it's really hard to predict how quickly the Overton Window will shift and how world leaders will react to this extremely scary technology." As a result, maybe we shouldn't dismiss the ambitious strategies that would require greater-than-climate-change levels of international coordination. 

It isn't over yet. I think there's hope that some of the pragmatists will come out and realize that they are now "allowed" to say more than "maybe we should at least evaluate these systems with a >20% of ending humanity." I think Joe Collman's comment makes this point clearly and constructively:

The outcome I'm interested in is something like: every person with significant influence on policy knows that this is believed to be a good/ideal solution, and that the only reasons against it are based on whether it's achievable in the right form.

If ARC Evals aren't saying this, RSPs don't include it, and many policy proposals don't include it..., then I don't expect this to become common knowledge.

We're much less likely to get a stop if most people with influence don't even realize it's the thing that we'd ideally get.

Comment by Akash (akash-wasil) on Announcing Timaeus · 2023-10-23T12:39:07.654Z · LW · GW

Congrats on launching!

Besides funding, are there things other people could provide to help your group succeed? Are there particular types of people you’d be excited to collaborate with, get mentorship/advice from, etc?

Comment by Akash (akash-wasil) on Holly Elmore and Rob Miles dialogue on AI Safety Advocacy · 2023-10-20T22:53:03.578Z · LW · GW

I just want to say that I thought this was an excellent dialogue. 

It is very rare for two people with different views/perspectives to come together and genuinely just try to understand each other, ask thoughtful questions, and track their updates. This dialogue felt like a visceral, emotional reminder that this kind of thing is actually still possible. Even on a topic as "hot" as AI pause advocacy. 

Thank you to Holly, Rob, and Jacob. 

I'll also note that I've been proud of Holly for her activism. I remember speaking with her a bit when she was just getting involved. I was like: "she sure does have spirit and persistence– but I wonder if she'll really be able to make this work." But so far, I think she has. I'm impressed with how far she's come. 

I think she's been doing an excellent and thoughtful job so far. And this is despite navigating various tradeoffs, dealing with hostile reactions, being a pioneer in this space, and battling the lonely dissent that Rob mentioned.

I don't know what the future of AI pause advocacy will look like, and I'm not sure what the movement will become, but I'm very glad that Holly has emerged as one of its leaders.