Winners of AI Alignment Awards Research Contest 2023-07-13T16:14:38.243Z
AI Safety Newsletter #8: Rogue AIs, how to screen for AI risks, and grants for research on democratic governance of AI 2023-05-30T11:52:31.669Z
AI Safety Newsletter #7: Disinformation, Governance Recommendations for AI labs, and Senate Hearings on AI 2023-05-23T21:47:34.755Z
Eisenhower's Atoms for Peace Speech 2023-05-17T16:10:38.852Z
AI Safety Newsletter #6: Examples of AI safety progress, Yoshua Bengio proposes a ban on AI agents, and lessons from nuclear arms control 2023-05-16T15:14:45.921Z
AI Safety Newsletter #5: Geoffrey Hinton speaks out on AI risk, the White House meets with AI labs, and Trojan attacks on language models 2023-05-09T15:26:55.978Z
AI Safety Newsletter #4: AI and Cybersecurity, Persuasive AIs, Weaponization, and Geoffrey Hinton talks AI risks 2023-05-02T18:41:43.144Z
Discussion about AI Safety funding (FB transcript) 2023-04-30T19:05:34.009Z
Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) 2023-04-25T18:49:29.042Z
DeepMind and Google Brain are merging [Linkpost] 2023-04-20T18:47:23.016Z
AI Safety Newsletter #2: ChaosGPT, Natural Selection, and AI Safety in the Media 2023-04-18T18:44:35.923Z
Request to AGI organizations: Share your views on pausing AI progress 2023-04-11T17:30:46.707Z
AI Safety Newsletter #1 [CAIS Linkpost] 2023-04-10T20:18:57.485Z
Reliability, Security, and AI risk: Notes from infosec textbook chapter 1 2023-04-07T15:47:16.581Z
New survey: 46% of Americans are concerned about extinction from AI; 69% support a six-month pause in AI development 2023-04-05T01:26:51.830Z
[Linkpost] Critiques of Redwood Research 2023-03-31T20:00:09.784Z
What would a compute monitoring plan look like? [Linkpost] 2023-03-26T19:33:46.896Z
The Overton Window widens: Examples of AI risk in the media 2023-03-23T17:10:14.616Z
The Wizard of Oz Problem: How incentives and narratives can skew our perception of AI developments 2023-03-20T20:44:29.445Z
[Linkpost] Scott Alexander reacts to OpenAI's latest post 2023-03-11T22:24:39.394Z
Questions about Conjecure's CoEm proposal 2023-03-09T19:32:50.600Z
AI Governance & Strategy: Priorities, talent gaps, & opportunities 2023-03-03T18:09:26.659Z
Fighting without hope 2023-03-01T18:15:05.188Z
Qualities that alignment mentors value in junior researchers 2023-02-14T23:27:40.747Z
4 ways to think about democratizing AI [GovAI Linkpost] 2023-02-13T18:06:41.208Z
How evals might (or might not) prevent catastrophic risks from AI 2023-02-07T20:16:08.253Z
[Linkpost] Google invested $300M in Anthropic in late 2022 2023-02-03T19:13:32.112Z
Many AI governance proposals have a tradeoff between usefulness and feasibility 2023-02-03T18:49:44.431Z
Talk to me about your summer/career plans 2023-01-31T18:29:23.351Z
Advice I found helpful in 2022 2023-01-28T19:48:23.160Z
11 heuristics for choosing (alignment) research projects 2023-01-27T00:36:08.742Z
"Status" can be corrosive; here's how I handle it 2023-01-24T01:25:04.539Z
[Linkpost] TIME article: DeepMind’s CEO Helped Take AI Mainstream. Now He’s Urging Caution 2023-01-21T16:51:09.586Z
Wentworth and Larsen on buying time 2023-01-09T21:31:24.911Z
[Linkpost] Jan Leike on three kinds of alignment taxes 2023-01-06T23:57:34.788Z
My thoughts on OpenAI's alignment plan 2022-12-30T19:33:15.019Z
An overview of some promising work by junior alignment researchers 2022-12-26T17:23:58.991Z
Podcast: Tamera Lanham on AI risk, threat models, alignment proposals, externalized reasoning oversight, and working at Anthropic 2022-12-20T21:39:41.866Z
12 career-related questions that may (or may not) be helpful for people interested in alignment research 2022-12-12T22:36:21.936Z
Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas 2022-11-25T20:47:09.832Z
Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility 2022-11-22T22:19:09.419Z
Ways to buy time 2022-11-12T19:31:10.411Z
Instead of technical research, more people should focus on buying time 2022-11-05T20:43:45.215Z
Resources that (I think) new alignment researchers should know about 2022-10-28T22:13:36.537Z
Consider trying Vivek Hebbar's alignment exercises 2022-10-24T19:46:40.847Z
Possible miracles 2022-10-09T18:17:01.470Z
7 traps that (we think) new alignment researchers often fall into 2022-09-27T23:13:46.697Z
Alignment Org Cheat Sheet 2022-09-20T17:36:58.708Z
Apply for mentorship in AI Safety field-building 2022-09-17T19:06:12.753Z
Understanding Conjecture: Notes from Connor Leahy interview 2022-09-15T18:37:51.653Z


Comment by Akash (akash-wasil) on Measuring and Improving the Faithfulness of Model-Generated Reasoning · 2023-07-21T19:50:50.672Z · LW · GW

I’m excited to see both of these papers. It would be a shame if we reached superintelligence before anyone had put serious effort into examining if CoT prompting could be a useful way to understand models.

I also had some uneasiness while reading some sections of the post. For the rest of this comment, I’m going to try to articulate where that’s coming from.

Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.

I notice that I’m pretty confused when I read this sentence. I didn’t read the whole paper, and it’s possible I’m missing something. But one reaction I have is that this sentence presents an overly optimistic interpretation of the paper’s findings.

The sentence makes me think that Anthropic knows how to set the model size and task in such a way that it guarantees faithful explanations. Perhaps I missed something when skimming the paper, but my understanding is that we’re still very far away from that. In fact, my impression is that the tests in the paper are able to detect unfaithful reasoning, but we don’t yet have strong tests to confirm that a model is engaging in faithful reasoning.

Even supposing that the last sentence is accurate, I think there are a lot of possible “last sentences” that could have been emphasized. For example, the last sentence could have read:

  • “Overall, our results suggest that CoT is not reliably faithful, and further research will be needed to investigate its promise, especially in larger models.”
  • “Unfortunately, this finding suggests that CoT prompting may not yet be a promising way to understand the internal cognition of powerful models.” (note: “this finding” refers to the finding in the previous sentence: “as models become larger and more capable, they produce less faithful reasoning on most tasks we study.”)
  • “While these findings are promising, much more work on CoT is needed before it can reliably be used as a technique to reliably produce faithful explanations, especially in larger models.”

I’ve had similar reactions when reading other posts and papers by Anthropic. For example, in Core Views on AI Safety,  Anthropic presents (in my opinion) a much rosier picture of Constitutional AI than is warranted. It also nearly-exclusively cites its own work and doesn’t really acknowledge other important segments of the alignment community.

But Akash, aren’t you just being naive? It’s fairly common for scientists to present slightly-overly-rosy pictures of their research. And it’s extremely common for profit-driven companies to highlight their own work in their blog posts.

I’m holding Anthropic to a higher standard. I think this is reasonable for any company developing technology that poses a serious existential risk. I also think this is especially important in Anthropic’s case.  Anthropic has noted that their theory of change heavily involves: (a) understanding how difficult alignment is, (b) updating their beliefs in response to evidence, and (c) communicating alignment difficulty to the broader world.

To the extent that we trust Anthropic not to fool themselves or others, this plan has a higher chance of working. To the extent that Anthropic falls prey to traditional scientific or corporate cultural pressures (and it has incentives to see its own work as more promising than it actually is), this plan is weakened. As noted elsewhere, epistemic culture also has an important role in many alignment plans (see Carefully Bootstrapped Alignment is organizationally hard). 

(I’ll conclude by noting that I commend the authors of both papers, and I hope this comment doesn’t come off as overly harsh toward them. The post inspired me to write about a phenomenon I’ve been observing for a while. I’ve been gradually noticing similar feelings over time after reading various Anthropic comms, talking to folks who work at Anthropic, seeing Slack threads, analyzing their governance plans, etc. So this comment is not meant to pick on these particular authors. Rather, it felt like a relevant opportunity to illustrate a feeling I’ve often been noticing.)

Comment by Akash (akash-wasil) on AI #16: AI in the UK · 2023-06-18T21:31:29.264Z · LW · GW

Perhaps a podcast discussion between you two would be interesting and/or productive. Or perhaps a Slack discussion that turns into a post (sort of like this):

If you’re interested, I would be happy to moderate or help find a suitable moderator.

Comment by Akash (akash-wasil) on Lightcone Infrastructure/LessWrong is looking for funding · 2023-06-15T19:19:48.763Z · LW · GW

Thanks for this detailed response; I found it quite helpful. I maintain my "yeah, they should probably get as much funding as they want" stance. I'm especially glad to see that Lightcone might be interested in helping people stay sane/grounded as many people charge into the policy space. 

I ended up deciding to instead publish a short post, expecting that people will write a lot of questions in the comments, and then to engage straightforwardly and transparently there, which felt like a way that was more likely to end up with shared understanding.

This seems quite reasonable to me. I think it might've been useful to include something short in the original post that made this clear. I know you said "also feel free to ask any questions in the comments"; in an ideal world, this would probably be enough, but I'm guessing this isn't enough given power/status dynamics. 

For example, if ARC Evals released a post like this, I expect many people would experience friction that prevented them from asking (or even generating) questions that might (a) make ARC Evals look bad, (b) make the commenter seem dumb, or (c) potentially worsen the relationship between the commenter and ARC evals. 

To Lightcone's credit, I think Lightcone has maintained a (stronger) reputation of being fairly open to objections (and not penalizing people for asking "dumb questions" or something like that), but the Desire Not to Upset High-status People or Desire Not to Look Dumb In Front of Your Peers By Asking Things You're Already Supposed to Know are strong. 

I'm guessing that part of why I felt comfortable asking (and even going past the "yay, I like Lightcone and therefore I support this post" to the mental motion of "wait, am I actually satisfied with this post? What questions do I have") is that I've had a chance to interact in-person with the Lightcone team on many occasions, so I felt considerably less psychological friction than most.

All things considered, perhaps an ideal version of the post would've said something short like "we understand we haven't given any details about what we're actually planning to do or how we'd use the funding. This is because Oli finds this stressful. But we actually really want you to ask questions, even "dumb questions", in the comments."

(To be clear I don't think the lack of doing this was particularly harmful, and I think your comment definitely addresses this. I'm nit-picking because I think it's an interesting microcosm of broader status/power dynamics that get in the way of discourse, and because I expect the Lightcone team to be unusually interested in this kind of thing.)

Comment by Akash (akash-wasil) on Lightcone Infrastructure/LessWrong is looking for funding · 2023-06-14T17:04:49.762Z · LW · GW

I'm a fan of the Lightcone team & I think they're one of the few orgs where I'd basically just say "yeah, they should probably just get as much funding as they want."

With that in mind, I was surprised by the lack of information in this funding request. I feel mixed about this: high-status AIS orgs often (accurately) recognize that they don't really need to spend time justifying their funding requests, but I think this often harms community epistemics (e.g., by leading to situations where everyone is like "oh X org is great-- I totally support them" without actually knowing much about what work they're planning to do, what models they have, etc.)

Here are some questions I'm curious about:

  • What are Lightcone's plans for the next 3-6 months? (is it just going to involve continuing the projects that were listed?)
  • How is Lightcone orienting to the recent rise in interest in AI policy? Which policy/governance plans (if any) does Lightcone support?
  • What is Lightcone's general worldview/vibe these days? (Is it pretty much covered in this post?) Where does Lightcone disagree with other folks who work on reducing existential risk?
  • What are Lightcone's biggest "wins" and "losses" over the past ~3-6 months?
  • How much funding does Lightcone think it could usefully absorb, and what would the money be used for?
Comment by akash-wasil on [deleted post] 2023-06-09T23:49:07.953Z

I generally don't find writeups of standards useful, but this piece was an exception. Below, I'll try to articulate why:

I think AI governance pieces-- especially pieces about standards-- often have overly vague language. People say things like "risk management practices" or "third-party audits", phrases that are generally umbrella terms that lack specificity. These sometimes serve as applause lights (whether the author intended this or not): who could really disagree with the idea of risk management?

I liked that this piece (fairly unapologetically) advocates for specific things that labs should be doing. As an added benefit, the piece causes the reader to realize "oh wow, there's a lot of stuff that our civilization already does to mitigate this other threat-- if we were actually prioritizing AI x-risk as seriously as pandemics, there's a bunch of stuff we'd be doing differently." 

Naturally, there are some areas that lack detail (e.g., how do biolabs do risk assessments and how would AI labs do risk assessments?), but the reader is at least left with some references that allow them to learn more. I think this moves us to an acceptable level of concreteness, especially for a shallow investigation.

I think this is also the kind of thing that I could actually see myself handing to a policymaker or staffer (in reality, since they have no time, I would probably show them a one-pager or two-pager version of this, with this longer version linked).

I'll likely send this to junior AI governance folks as a solid example of a shallow investigation that could be helpful. In terms of constructive feedback, I think the piece could've had a TLDR section (or table) that directly lists each of the recommendations for AI labs and each of the analogs in biosafety. [I might work on such a table and share it here if I produce it]. 

Comment by Akash (akash-wasil) on Announcing Apollo Research · 2023-05-31T18:16:03.347Z · LW · GW

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off? 

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this. 

Comment by Akash (akash-wasil) on Seeking (Paid) Case Studies on Standards · 2023-05-26T19:39:28.181Z · LW · GW

Excited to see this! I'd be most excited about case studies of standards in fields where people didn't already have clear ideas about how to verify safety.

In some areas, it's pretty clear what you're supposed to do to verify safety. Everyone (more-or-less) agrees on what counts as safe.

One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.

Are there examples of standards in other industries where people were quite confused about what "safety" would require? Are there examples of standards that are specific enough to be useful but flexible enough to deal with unexpected failure modes or threats? Are there examples where the standards-setters acknowledged that they wouldn't be able to make a simple checklist, so they requested that companies provide proactive evidence of safety?

Comment by Akash (akash-wasil) on The Office of Science and Technology Policy put out a request for information on A.I. · 2023-05-25T13:37:50.099Z · LW · GW

I've been working on a response to the NTIA request for comments on AI Accountability over the last few months. It's likely that I'll also submit something to the OSTP request.

I've learned a few useful things from talking to AI governance and policy folks. Some of it is fairly intuitive but still worth highlighting (e.g., try to avoid jargon, remember that the reader doesn't share many assumptions that people in AI safety take for granted, remember that people have many different priorities). Some of it is less intuitive (e.g., what actually happens with the responses? How long should your response be? How important is it to say something novel? What kinds of things are policymakers actually looking for?)

If anyone is looking for advice, feel free to DM me. 

Comment by Akash (akash-wasil) on How MATS addresses “mass movement building” concerns · 2023-05-04T01:31:46.340Z · LW · GW

Glad to see this write-up & excited for more posts.

I think these are three areas that MATS feels like it has handled fairly well. I'd be especially excited to hear more about areas where MATS thinks it's struggling, MATS is uncertain, or where MATS feels like it has a lot of room to grow. Potential candidates include:

  • How is MATS going about talent selection and advertising for the next cohort, especially given the recent wave of interest in AI/AI safety?
  • How does MATS intend to foster (or recruit) the kinds of qualities that strong researchers often possess?
  • How does MATS define "good" alignment research? 

Other things I'm be curious about:

  • Which work from previous MATS scholars is the MATS team most excited about? What are MATS's biggest wins? Which individuals or research outputs is MATS most proud of?
  • Most peoples' timelines have shortened a lot since MATS was established. Does this substantially reduce the value of MATS (relative to worlds with longer timelines)?
  • Does MATS plan to try to attract senior researchers who are becoming interested in AI Safety (e.g., professors, people with 10+ years of experience in industry)? Or will MATS continue to recruit primarily from the (largely younger and less experienced) EA/LW communities?
Comment by Akash (akash-wasil) on AGI safety career advice · 2023-05-02T17:22:32.214Z · LW · GW

(Pasting this exchange from a comment thread on the EA Forum; bolding added)

Peter Park:

Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.

I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially difficult to achieve in a reasonable period of time (if ever) and the implementation of governance ideas is bottlenecked by public support.

For a similar argument that I found particularly compelling, please check out Greg Colbourn’s recent post:


I don't actually think the implementation of governance ideas is mainly bottlenecked by public support; I think it's bottlenecked by good concrete proposals. And to the extent that it is bottlenecked by public support, that will change by default as more powerful AI systems are released.


I appreciate Richard stating this explicitly. I think this is (and has been) a pretty big crux in the AI governance space right now.

Some folks (like Richard) believe that we're mainly bottlenecked by good concrete proposals. Other folks believe that we have concrete proposals, but we need to raise awareness and political support in order to implement them.

I'd like to see more work going into both of these areas. On the margin, though, I'm currently more excited about efforts to raise awareness [well], acquire political support, and channel that support into achieving useful policies. 

I think this is largely due to (a) my perception that this work is largely neglected, (b) the fact that a few AI governance professionals I trust have also stated that they see this as the higher priority thing at the moment, and (c) worldview beliefs around what kind of regulation is warranted (e.g., being more sympathetic to proposals that require a lot of political will).

Comment by Akash (akash-wasil) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-27T20:11:37.291Z · LW · GW

Nice-- very relevant. I agree with Evan that arguments about the training procedure will be relevant (I'm more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible). 

Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.

(In fact, I think arguments that meet some sort of "beyond-a-reasonable-doubt" threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)

Comment by Akash (akash-wasil) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-27T19:11:36.672Z · LW · GW

Can you say more about what part of this relates to a ban on AI development?

I think the claim "AI development should be regulated in a way such that the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" seems quite different from the claim "AI development should be banned", but it's possible that I'm missing something here or communicating imprecisely. 

Comment by Akash (akash-wasil) on Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous) · 2023-04-25T19:16:33.787Z · LW · GW

This makes sense. Can you say more about how aviation regulation differs from the FDA?

In other words, are there meaningful differences in how the regulatory processes are set up? Or does it just happen to be the case that the FDA has historically been worse at responding to evidence compared to the Federal Aviation Administration? 

(I think it's plausible that we would want a structure similar to the FDA even if the particular individuals at the FDA were bad at cost-benefit analysis, unless there are arguments that the structure of the FDA caused the bad cost-benefit analyses).

Comment by Akash (akash-wasil) on No Summer Harvest: Why AI Development Won't Pause · 2023-04-06T16:58:03.418Z · LW · GW

My understanding of your claim is something like:

  • Claim 1: Cooperation with China would likely require a strong Chinese AI safety community
  • Claim 2: The Chinese AI safety community is weak
  • Conclusion: Therefore, cooperation with China is infeasible

I don't have strong takes on claim 2, but I think (at least at first glance) disagree with claim 1. It seems quite plausible to imagine international cooperation without requiring strong domestic AI safety communities in each country that opts-in to the agreement. If the US tried sufficiently hard, and was willing to make trades/sacrifices, it seems plausible to me that it could get buy-in from other countries even if there weren't strong domestic AIS communities.

Also, traditionally when people talk about the Chinese AI Safety community, they often talk about people who are in some way affiliated with or motivated by EA/LW ideas. There are 2-3 groups that always get cited.

I think this is pretty limited. I expect that, especially as AI risk continues to get more mainstream, we're going to see a lot of people care about AI safety from different vantage points. In other words, there's still time to see new AI safety movements form in China (and elsewhere), even if they don't involve the 2-3 "vetted" AI safety groups calling the shots.

Finally, there are ultimately a pretty small number of major decision-makers. If the US "led the way" on AI safety conversations, it may be possible to get buy-in from those small number of decision-makers.

To be clear, I'm not wildly optimistic about unprecedented global cooperation. (There's a reason "unprecedented" is in the phrase!) But I do think there are some paths to success that seem plausible even if the current Chinese AI safety community is not particularly strong. (And note I don't claim to have informed views about how strong it is). 

Comment by Akash (akash-wasil) on [Linkpost] Critiques of Redwood Research · 2023-03-31T20:00:39.013Z · LW · GW

Copying over my comment from the EA Forum version.

I think it's great that you're releasing some posts that criticize/red-team some major AIS orgs. It's sad (though understandable) that you felt like you had to do this anonymously. 

I'm going to comment a bit on the Work Culture Issues section. I've spoken to some people who work at Redwood, have worked at Redwood, or considered working at Redwood.

I think my main comment is something like you've done a good job pointing at some problems, but I think it's pretty hard to figure out what should be done about these problems. To be clear, I think the post may be useful to Redwood (or the broader community) even if you only "point at problems", and I don't think people should withhold these write-ups unless they've solved all the problems.

But in an effort to figure out how to make these critiques more valuable moving forward, here are some thoughts:

  • If I were at Redwood, I would probably have a reaction along the lines of "OK, you pointed out a list of problems. Great. We already knew about most of these. What you're not seeing is that there are also 100 other problems that we are dealing with: lack of management experience, unclear models of what research we want to do, an ever-evolving AI progress landscape, complicated relationships we need to maintain, interpersonal problems, a bunch of random ops things, etc. This presents a tough bind: on one hand, we see some problems, and we want to fix them. On the other hand, we don't know any easy ways to fix them that don't trade-off against other extremely important priorities."
  • As an example, take the "intense work culture" point. The most intuitive reaction is "make the work culture less intense-- have people work fewer hours." But this plausibly has trade-offs with things like research output. You could make the claim that "on the margin, if Redwood employees worked 10 fewer hours per week, we expect Redwood would be more productive in the long-run because of reduced burnout and a better culture", but this is a substantially different (and more complicated) claim to make. And it's not obviously-true. 
  • As another example, take the "people feel pressure to defer" point. I personally agree that this is a big problem for Redwood/Constellation/the Bay Area scene. My guess is Buck/Nate/Bill agree. It's possible that they don't think it's a huge deal relative to the other 100 things on their plate. And maybe they're wrong about that, but I think that needs to be argued for if you want them to prioritize it. Alternatively, the problem might be that they simply don't know what to do. Like, maybe they could put up a sign that says "please don't defer-- speak your mind!" Or maybe they could say "thank you" more when people disagree, or something. But I think often the problem is that people don't know what interventions would be able to fix well-known problems (again, without trading off against something else that is valuable).

I'm also guessing that there are some low-hanging fruit interventions that external red-teamers could identify. For example, here are three things that I think Redwood should do:

  1. Hire a full-time productivity coach/therapist for the Constellation offices. (I recommended this to Nate many months ago. He seemed to (correctly, imo) predict that burnout would be a big problem for Redwood employees, and he said he'd think about the therapist/coach suggestion. I believe they haven't hired one.) 
  2. Hire an external red-teamer to interview current and former employees, identify work culture issues, and identify interventions to improve things. Conditional on this person/team identifying useful (and feasible) interventions, work with leadership to actually get them implemented. (I'm not sure if they're doing this, and also maybe your group is already doing this, but the post focused on problems rather than interventions?)
  3. Have someone red-team communications around employee expectations, work-trial expectations, and expectation-setting during the onboarding process. I think I'm fine with some people opting-in to a culture that expects them to work X hours a week and has Y intensity aspects. I'm less fine with people feeling misled or people feeling unable to communicate about their needs. It seems plausible to me that many of the instances of "Person gets fired or quits and then feels negatively toward Redwood & encourages people not to work there" (which happens, btw) could be avoided/lessened through really good communication/onboarding/expectation-setting. (I have no idea what Redwood's current procedure is like, but I'd predict that a sharp red-teamer would be able to find 3+ improvements). 

These are three examples of interventions that seem valuable and (relatively) low-cost to me. I'd be excited to see if your team came up with any intervention ideas, and I'd be excited to see a "proposed intervention" section in future reports. (Though again, I don't think you should feel like you need to do this, and I think it's good to get things out there even if they're just raising awareness about problems).

Comment by Akash (akash-wasil) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-31T05:15:34.638Z · LW · GW

if you're not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary

Briefly noting that the policy "I will not respond to every single high-effort criticism I receive" is very different from "I am not willing to engage with people who give high-effort criticism."

And the policy "sometimes I will ask people who write high-effort criticism to point me to their strongest argument and then I will engage with that" is also different from the two policies mentioned above.

Comment by Akash (akash-wasil) on Want to win the AGI race? Solve alignment. · 2023-03-29T23:42:48.295Z · LW · GW

I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn's recent "If your model is going to sell, it has to be safe" piece. Let me try to unpack this:

On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.

My biggest reservations can be boiled down into two points: 

  1. I don't think that commercial incentives will be enough to motivate people to solve the hardest parts of alignment. Commercial incentives will drive people to make sure their system appears to do what users want, which is very different than having systems that actually do what users want or robustly do what users want even as they become more powerful. Or to put it another way: near-term commercial incentives don't really cause me to put appropriate amounts of attention on things like situational awareness or deceptive alignment. I think commercial incentives will be sufficient to reduce the odds of Bingchat fiascos, but I don't think they'll motivate the kind of alignment research that's trying to handle deception, sharp left turns, or even the most ambitious types of scalable oversight work.
  2. The research that is directly incentivized by commercial interests is least likely to be neglected. I expect the most neglected research to be research that doesn't have any direct commercial benefit. I expect AGI labs will invest a substantial amount of resources to prevent future Bingchat scenarios and other instances of egregious deployment harms. The problem is that I expect many of these approaches (e.g., getting really good at RLHFing your model such that it no longer displays undesirable behaviors) will not generalize to more powerful systems. I think you (and many others) agree with this, but I think the important point here is that the economic incentives will favor RLHFy stuff over stuff that tackles problems that are not as directly commercially incentivized.

As a result, even though I agree with many of your subclaims, I'm still left thinking, "huh, the message I want to spread is not something like "hey, in order to win the race or sell your product, you need to solve alignment." 

But rather something more like "hey, there are some safety problems you'll need to figure out to sell/deploy your product. Cool that you're interested in that stuff. There are other safety problems-- often ones that are more speculative-- that the market is not incentivizing companies to solve. On the margin, I want more attention paid to those problems. And if we just focus on solving the problems that are required for profit/deployment, we will likely fool ourselves into thinking that our systems are safe when they merely appear to be safe, and we may underinvest in understanding/detecting/solving some of the problems that seem most concerning from an x-risk perspective."

Comment by Akash (akash-wasil) on Shutting Down the Lightcone Offices · 2023-03-15T15:36:40.493Z · LW · GW

It seems to me like one (often obscured) reason for the disagreement between Thomas and Habryka is that they are thinking about different groups of people when they define "the field."

To assess the % of "the field" that's doing meaningful work, we'd want to do something like [# of people doing meaningful work]/[total # of people in the field].

Who "counts" in the denominator? Should we count anyone who has received a grant from the LTFF with the word "AI safety" in it? Only the ones who have contributed object-level work? Only the ones who have contributed object-level work that passes some bar? Should we count the Anthropic capabilities folks? Just the EAs who are working there?

My guess is that Thomas was using more narrowly defined denominator (e.g., not counting most people who got LTFF grants and went off to to PhDs without contributing object-level alignment stuff; not counting most Anthropic capabilities researchers who have never-or-minimally engaged with the AIS community) whereas Habryka was using a more broadly defined denominator.

I'm not certain about this, and even if it's true, I don't think it explains the entire effect size. But I wouldn't be surprised if roughly 10-30% of the difference between Thomas and Habryka might come from unstated assumptions about who "counts" in the denominator. 

(My guess is that this also explains "vibe-level" differences to some extent. I think some people who look out into the community and think "yeah, I think people here are pretty reasonable and actually trying to solve the problem and I'm impressed by some of their work" are often defining "the community" more narrowly than people who look out into the community and think "ugh, the community has so much low-quality work and has a bunch of people who are here to gain influence rather than actually try to solve the problem.")

Comment by Akash (akash-wasil) on Podcast Transcript: Daniela and Dario Amodei on Anthropic · 2023-03-07T23:12:13.187Z · LW · GW

Quick note that this is from a year ago: March 4, 2022. (Might be good to put this on top of the post so people don't think it's from 2023). 

Comment by Akash (akash-wasil) on What are MIRI's big achievements in AI alignment? · 2023-03-07T22:54:21.028Z · LW · GW

I think a lot of threat models (including modern threat models) are found in, or heavily inspired by, old MIRI papers. I also think MIRI papers provide unusually clear descriptions of the alignment problem, why MIRI expects it to be hard, and why MIRI thinks intuitive ideas won't work (see e.g., Intelligence Explosion: Evidence and Import, Intelligence Explosion Microeconomics, and Corrigibility). 

Regarding more recent stuff, MIRI has been focusing less on research output and more on shaping discussion around alignment. They are essentially "influencers" on the alignment space. Some people I know label this as "not real research", which I think is true in some sense, but I think more about "what was the impact of this" than "does it fit into the definition of a particular term."  

For specifics, List of Lethalities and Death with Dignity have had a pretty strong effect on discourse in the alignment community (whether or not this is "good" depends on the degree to which you think MIRI is correct and the degree to which you think the discourse has shifted in a good vs. bad direction). On how various plans miss the hard bits of the alignment challenge remains one of the best overviews/critiques of the field of alignment, and the sharp left turn post is a recent piece that is often cited to describe a particularly concerning (albeit difficult to understand) threat model. Six dimensions of operational adequacy is currently one of the best (and only) posts that tries to envision a responsible AI lab. 

Some people have found the 2021 MIRI Dialogues to be extremely helpful at understanding the alignment problem, understanding threat models, and understanding disagreements in the field. 

I believe MIRI occasionally advises people at other organizations (like Redwood, Conjecture, Open Phil) on various decisions. It's unclear to me how impactful their advice is, but it wouldn't surprise me if one or more orgs had changed their mind about meaningful decisions (e.g., grantmaking priorities or research directions) partially as a result of MIRI's advice. 

There's also MIRI's research, though I think this gets less attention at the moment because MIRI isn't particularly excited about it. But my guess is that if someone made a list of all the alignment teams, MIRI would currently have 1-2 teams in the top 20. 

Comment by Akash (akash-wasil) on Comments on OpenAI's "Planning for AGI and beyond" · 2023-03-05T01:10:45.464Z · LW · GW

With my comments, I was hoping to spark more of a back-and-forth. Having failed at that, I'm guessing part of the problem is that I didn't phrase my disagreements bluntly or strongly enough, while also noting various points of agreement, which might have overall made it sound like I had only minor disagreements.

Did you ask for more back-and-forth, or were you hoping Sam would engage in more back-and-forth without being explicitly prompted?

If it's the latter, I think the "maybe I made it seem like I only had minor disagreements" hypothesis is less likely than the "maybe Sam didn't even realize that I wanted to have more of a back-and-forth" hypothesis. 

I also suggest asking more questions when you're looking for back-and-forth. To me, a lot of your comments didn't seem to be inviting much back-and-forth, but adding questions would've changed this (even simple things like "what do you think?" or "Can you tell me more about why you believe X?")

Comment by Akash (akash-wasil) on Would more model evals teams be good? · 2023-03-03T05:55:32.575Z · LW · GW

Does this drive a "race to the bottom," where more lenient evals teams get larger market share

I appreciate you asking this, and I find this failure mode plausible. It reminds me of one of the failure modes I listed here (where a group proposing strict evals gets outcompeted by a group proposing looser evals).

Governance failure: We are outcompeted by a group that develops (much less demanding) evals/standards (~10%). Several different groups develop safety standards for AI labs. One group has expertise in AI privacy and data monitoring, another has expertise in ML fairness and bias, and a third is a consulting company that has advised safety standards in a variety of high-stakes contexts (e.g., biosecurity, nuclear energy). 

Each group proposes their own set of standards. Some decision-makers at top labs are enthusiastic about The Unified Evals Agreement, but others are skeptical. In addition to object-level debates, there are debates about which experts should be trusted. Ultimately, lab decision-makers end up placing more weight on teams with experience implementing safety standards in other sectors, and they go with the standards proposed by the consulting group. Although these standards are not mutually exclusive with The Unified Evals Agreement, lab decision-makers are less motivated to adopt new standards (“we just implemented evals-- we have other priorities right now.”). The Unified Evals Agreement is crowded out by standards that have much less emphasis on long-term catastrophic risks from AI systems. 

Nonetheless, the "vibe" I get is that people seem quite confident that this won't happen. Perhaps because labs care a lot about x-risk and want to have high-quality evals, perhaps because lots of the people working on evals have good relationships with labs, and perhaps because they expect there aren't many groups working on evals/standards (except for the xrisk-motivated folks).

Comment by Akash (akash-wasil) on Aspiring AI safety researchers should ~argmax over AGI timelines · 2023-03-03T05:46:09.401Z · LW · GW

However, many of these people might not have a sufficient “toolbox” or research experience to have much marginal impact in short timelines worlds.

I think this is true for some people, but I also think people tend to overestimate the amount of years it takes to have enough research experience to contribute. 

I think a few people have been able to make useful contributions within their first year (though in fairness they generally had backgrounds in ML or AI, so they weren't starting completely from scratch), and several highly respected senior researchers have just a few years of research experience. (And they, on average, had less access to mentorship/infrastructure than today's folks). 

I also think people often overestimate the amount of time it takes to become an expert in a specific area relevant to AI risk (like subtopics in compute governance, information security, etc.)

Finally, I think people should try to model community growth & neglectedness of AI risk in their estimates. Many people have gotten interested in AI safety in the last 1-3 years. I expect that many more will get interested in AI safety in the upcoming years. Being one researcher in a field of 300 seems more useful than being one researcher in a field of 1500. 

With all that in mind, I really like this exercise, and I expect that I'll encourage people to do this in the future:

  1. Write out your credences for AGI being realized in 2027, 2032, and 2042;
  2. Write out your plans if you had 100% credence in each of 2027, 2032, and 2042;
  3. Write out your marginal impact in lowering P(doom) via each of those three plans;
  4. Work towards the plan that is the argmax of your marginal impact, weighted by your credence in the respective AGI timelines.
Comment by Akash (akash-wasil) on Fighting without hope · 2023-03-01T19:50:40.869Z · LW · GW

I appreciate the comment and think I agree with most of it. Was there anything in the post that seemed to disagree with this reasoning?

Comment by Akash (akash-wasil) on A case for capabilities work on AI as net positive · 2023-03-01T04:44:40.379Z · LW · GW

I downvoted the post because I don't think it presents strong epistemics. Some specific critiques:

  • The author doesn't explain the reasoning that produced the updates. (They link to posts, but I don't think it's epistemically sound to link to say "I made updates and you can find the reasons why in these posts." At best, people read the posts, and then come away thinking "huh, I wonder which of these specific claims/arguments were persuasive to the poster.")
  • The author recommends policy changes (to LW and the field of alignment) that (in my opinion) don't seem to follow from the claims presented. (The claim "LW and the alignment community should shift their focuses" does not follow from "there is a 50-70% chance of alignment by default". See comment for more).
  • The author doesn't explain their initial threat model, why it was dominated by deception, and why they're unconvinced by other models of risk & other threat models.

I do applaud the author for sharing the update and expressing an unpopular view. I also feel some pressure to not downvote it because I don't want to be "downvoting something just because I disagree with it", but I think in this case it really is the post itself. (I didn't downvote the linked post, for example).

Comment by Akash (akash-wasil) on A case for capabilities work on AI as net positive · 2023-03-01T04:32:18.836Z · LW · GW

In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.

Let's suppose that you are entirely right about deceptive alignment being unlikely. (So we'll set aside things like "what specific arguments caused you to update?" and tricky questions about modest epistemology/outside views).

I don't see how "alignment is solved by default with 30-50% probability justifies claims like "capabilities progress is net positive" or "AI alignment should change purpose to something else."

If a doctor told me I had a disease that had a 50-70% chance to resolve on its own, otherwise it would kill me, I wouldn't go "oh okay, I should stop trying to fight the disease."

The stakes are also not symmetrical. Getting (aligned) AGI 1 year sooner is great, but it only leads to one extra year of flourishing. Getting unaligned AGI leads to a significant loss over the entire far-future. 

So even if we have a 50-70% chance of alignment by default, I don't see how your central conclusions follow.

Comment by Akash (akash-wasil) on Sam Altman: "Planning for AGI and beyond" · 2023-02-24T21:22:30.919Z · LW · GW

I don't agree with everything in the post, but I do commend Sam for writing it. I think it's a rather clear and transparent post that summarizes some important aspects of his worldview, and I expect posts like this to be extremely useful for discourse about AI safety.

Here are three parts I found especially clear & useful to know:

Thoughts on safety standards

We think it’s important that efforts like ours submit to independent audits before releasing new systems; we will talk about this in more detail later this year. At some point, it may be important to get independent review before starting to train future systems, and for the most advanced efforts to agree to limit the rate of growth of compute used for creating new models. We think public standards about when an AGI effort should stop a training run, decide a model is safe to release, or pull a model from production use are important. Finally, we think it’s important that major world governments have insight about training runs above a certain scale.

Thoughts on openness

We now believe we were wrong in our original thinking about openness, and have pivoted from thinking we should release everything (though we open source some things, and expect to open source more exciting things in the future!) to thinking that we should figure out how to safely share access to and benefits of the systems. We still believe the benefits of society understanding what is happening are huge and that enabling such understanding is the best way to make sure that what gets built is what society collectively wants (obviously there’s a lot of nuance and conflict here)

Connection between capabilities and safety

Importantly, we think we often have to make progress on AI safety and capabilities together. It’s a false dichotomy to talk about them separately; they are correlated in many ways. Our best safety work has come from working with our most capable models. That said, it’s important that the ratio of safety progress to capability progress increases.

Thoughts on timelines & takeoff speeds

AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt.

It’s possible that AGI capable enough to accelerate its own progress could cause major changes to happen surprisingly quickly (and even if the transition starts slowly, we expect it to happen pretty quickly in the final stages). We think a slower takeoff is easier to make safe, and coordination among AGI efforts to slow down at critical junctures will likely be important (even in a world where we don’t need to do this to solve technical alignment problems, slowing down may be important to give society enough time to adapt).

Comment by Akash (akash-wasil) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2023-02-20T08:08:01.284Z · LW · GW

You might be interested in this paper and this LessWrong tag.

Comment by Akash (akash-wasil) on Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems · 2023-02-17T21:09:23.019Z · LW · GW

Nevertheless, my guess is that it's more dignified for us to have these sorts of reporting systems than to not have them

Can you elaborate on this one? (I don't have a strong opinion one way or the other; seems unclear to me. If this system had been in place before Bing, and it had properly fixed all the issues with Bing, it seems plausible to me that this would've been net negative for x-risk reduction. The media coverage on Bing seems good for getting people to be more concerned about alignment and AI safety, reducing trust in a "we'll just figure it out as we go" mentality, increasing security mindset, and providing a wider platform for alignment folks.)

for cybersecurity experts to be involved in the creation of the systems surrounding AI than for them to not be involved

This seems good to me, all else equal (and might outweigh the point above). 

for people interested in the large-scale problems to be contributing (in dignity-increasing ways) to companies which are likely to be involved in causing those large-scale problems than not

This also seems good to me, though I agree that the case isn't clear. It also likely depends a lot on the individual and their counterfactual (e.g., some people might have strong comparative advantages in independent research or certain kinds of coordination/governance roles that require being outside of a lab).

Comment by Akash (akash-wasil) on My understanding of Anthropic strategy · 2023-02-16T20:50:32.741Z · LW · GW

It does! I think I'd make it more explicit, though, that the post focuses on the views/opinions of people at Anthropic. Maybe something like this (new text in bold):

This post is the first half of a series about my attempts understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying. (The impetus for looking into this was to answer the question of whether I should join Anthropic's ops team.) As part of my research, I read a number of Anthropic’s published papers, and spoke to people within and outside of Anthropic. 

This post focuses on opinions that I heard from people who work at Anthropic. The second post will focus on my own personal interpretation and opinion on whether Anthropic's work is net positive (which is filtered through my worldview and which I think most people at Anthropic would disagree with. )

Comment by Akash (akash-wasil) on My understanding of Anthropic strategy · 2023-02-16T20:25:49.962Z · LW · GW

+1. I think this framing is more accurate than the current first paragraph (which, in my reading of it, seems to promise a more balanced and comprehensive analysis).

Comment by Akash (akash-wasil) on Hashing out long-standing disagreements seems low-value to me · 2023-02-16T20:23:02.095Z · LW · GW

Even when these discussions don't produce agreement, do you think they're helpful for the community?

I've spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants). 

Three other thoughts/observations:

  1. I'd be interested in seeing a list of strategies/techniques you've tried and offering suggestions. (I believe that you've tried a lot of things, though).
  2. I was surprised at how low the hour estimates were, particularly for the OP people (especially Holden) and even for Paul. I suppose the opportunity cost of the people listed is high, but even so, the estimates make me think "wow, seems like not that much concentrated time has been spent trying to resolve some critical disagreements about the world's most important topic" (as opposed to like "oh wow, so much time has been spent on this").
  3. How have discussions gone with some of the promising newer folks? (People with like 1-3 years of experience). Perhaps these discussions are less fruitful than marginal time with Paul (because newer folks haven't thought about things for very long), but perhaps they're more fruitful (because they have different ways of thinking, or maybe they just have different personalities/vibes). 
Comment by Akash (akash-wasil) on My understanding of Anthropic strategy · 2023-02-16T19:19:17.887Z · LW · GW

+1. A few other questions I'm interested in:

  • Which threat models is Anthropic most/least concerned about?
  • What are Anthropic's thoughts on AGI ruin arguments?
  • Would Anthropic merge-and-assist if another safety-conscious project comes close to building AGI?
  • What kind of evidence would update Anthropic away from (or more strongly toward) their current focus on empiricism/iteration?
  • What are some specific observations that made/continue-to-make Anthropic leadership concerned about OpenAI's commitment to safety?
  • What does Anthropic think about DeepMind's commitment to safety?
  • What is Anthropic's governance/strategy theory-of-change?
  • If Anthropic gets AGI, what do they want to do with it?

I'm sympathetic to the fact that it might be costly (in terms of time and possibly other factors like reputation) to respond to some of these questions. With that in mind, I applaud DeepMind's alignment team for engaging with some of these questions, I applaud OpenAI for publicly stating their alignment plan, and I've been surprised that Anthropic has engaged the least (at least publicly, to my knowledge) about these kinds of questions.

Comment by Akash (akash-wasil) on My understanding of Anthropic strategy · 2023-02-16T19:08:28.341Z · LW · GW

Thank you for sharing this; I'd be excited to see more writeups that attempt to analyze the strategy of AI labs.

This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.

I found that this introduction raised my expectations for the post and misled me a bit. After reading the introduction, I was expecting to see more analysis of the pros and cons of Anthropic's strategy, as well as more content from people who disagree with Anthropic's strategy. 

(For example, you have a section in which you list protective factors as reported by Anthropic staff, but there is no corresponding section that features criticisms from others-- e.g., independent safety researchers, OpenAI employees, etc.)

To be clear, I don't think you should have to do any of that to publish a post like this. I just think that the expectation-setting could have been better. (I plan to recommend this post, but I won't say "here's a post that lays out the facts to consider in terms of whether Anthropic's work is likely to be net positive"; instead, I'll say "here's a post where someone lists some observations about Anthropic's strategy, and my impression is that this was informed largely by talking to Anthropic staff and Anthropic supporters. It seems to underrepresent the views of critics, but I still think it's valuable read.")

Comment by Akash (akash-wasil) on Qualities that alignment mentors value in junior researchers · 2023-02-16T01:32:56.408Z · LW · GW

+1. I'll note though that there are some socially acceptable ways of indicating "smarter" (e.g., better reasoning, better judgment, better research taste). I was on the lookout for these kinds of statements, and I rarely found them. The closest thing that came up commonly was the "strong and concrete models of AI safety" (which could be loosely translated into "having better and smarter thoughts about alignment"). 

Comment by Akash (akash-wasil) on A (EtA: quick) note on terminology: AI Alignment != AI x-safety · 2023-02-09T02:13:19.240Z · LW · GW

I appreciate this post and your previous post. Fwiw, I think these terminology concerns/confusions are harming discourse on AI existential safety, and I expect posts like these to help people talk-past-each-other less, notice subtle distinctions, deconfuse more quickly, etc. 

(I especially like the point about how increasing intent alignment on the margin doesn't necessarily help much with increasing intent alignment in the limit. Some version of this idea has come up a few times in discussions about OpenAI's alignment plan, and the way you presented it makes the point clearer/crisper imo). 

Comment by Akash (akash-wasil) on Evaluations (of new AI Safety researchers) can be noisy · 2023-02-06T14:43:45.414Z · LW · GW

Great post. I expect to recommend it at least 10 times this year. 

Semi-related point: I often hear people get discouraged when they don't have "good ideas" or "ideas that they believe in" or "ideas that they are confident would actually reduce x-risk." (These are often people who see the technical alignment problem as Hard or Very Hard).

I'll sometimes ask "how many other research agendas do you think meet your bar for "an idea you believe in" or "an idea that you are confident would actually reduce x-risk?" Often, when considering the entire field of technical alignment, their answer is <5 or <10. 

While reality doesn't grade on a curve, I think it has sometimes been helpful for people to reframe "I have no good ideas" --> "I believe the problem we are facing is Hard or Very Hard. Among the hundreds of researchers who are thinking about this, I think only a few of them have met the bar that I sometimes apply to myself & my ideas."

(This is especially useful when people are using a harsher bar to evaluate themselves than when they evaluate others, which I think is common).

Comment by Akash (akash-wasil) on Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review) · 2023-01-31T23:19:53.877Z · LW · GW

Good question. I'm using the term "idea" pretty loosely and glossily. 

Things that would meet this vague definition of "idea":

  • The ELK problem (like going from nothing to "ah, we'll need a way of eliciting latent knowledge from AIs")
  • Identifying the ELK program as a priority/non-priority (generating the arguments/ideas that go from "this ELK thing exists" to "ah, I think ELK is one of the most important alignment directions" or "nope, this particular problem/approach doesn't matter much"
  • An ELK proposal
  • A specific modification to an ELK proposal that makes it 5% better. 

So new ideas could include new problems/subproblems we haven't discovered, solutions/proposals, code to help us implement proposals, ideas that help us prioritize between approaches, etc. 

How are you defining "idea" (or do you have a totally different way of looking at things)?

Comment by Akash (akash-wasil) on Advice I found helpful in 2022 · 2023-01-31T23:12:53.597Z · LW · GW

I spent the first half-3/4 of 2022 focused on AIS field-building projects. In the last few months, I've been focusing more on understanding AI risk threat models & strategy/governance research projects.

Before 2022, I was a PhD student researching scalable mental health interventions (see here).

Comment by Akash (akash-wasil) on Reflections on Deception & Generality in Scalable Oversight (Another OpenAI Alignment Review) · 2023-01-28T19:41:37.909Z · LW · GW

Thanks for sharing this. I've been looking forward to your thoughts on OpenAI's plan, and I think you presented them succinctly/clearly. I found the "evaluation vs generation" section particularly interesting/novel.

One thought: I'm currently not convinced that we would need general intelligence in order to generate new alignment ideas.

The alignment problem needs high general intelligence, because it needs new ideas for solving alignment. It won’t be enough to input all the math around the alignment problem and have the AI solve that. It's a great improvement over what we have but it will only gain us speed, not insight.

It seems plausible to me that we could get original ideas out of systems that were subhuman in general intelligence but superhuman in particular domains. 

Example 1: Superhuman in processing speed. Imagine an AI that never came up with any original thought on its own. But it has superhuman processing speed, so it comes up with ideas 10000X faster than us. It never comes up with an idea that humans wouldn't ever have been able to come up with, but it certainly unlocks a bunch of ideas that we wouldn't have discovered by 2030, or 2050, or whenever the point-of-no-return is. 

Example 2: Superhuman in "creative idea generation". Anecdotally, some people are really good at generating lots of possible ideas, but they're not intelligent enough to be good at filtering them. I could imagine a safe AI system that is subhuman at idea filtering (or "pruning") but superhuman at idea generation (or "babbling"). 

Whether or not we will actually be able to produce such systems is a much harder question. I lack strong models of "how cognition works", and I wouldn't find it crazy if people with stronger models of cognition were like "wow, this is so unrealistic-- it's just not realistic for us to find an agent with superhuman creativity unless it's also superhuman at a bunch of other things and then it cuts through you like butter."

But at least conceptually, it seems plausible to me that we could get new ideas with systems that are narrowly intelligent in specific domains without requiring them to be generally intelligent. (And in fact, this is currently my greatest hope for AI-assisted alignment schemes).

Comment by Akash (akash-wasil) on My Model Of EA Burnout · 2023-01-26T03:43:32.175Z · LW · GW

I appreciate you writing this. I found myself agreeing much of it. The post also helped me notice some feeling of "huh, something seems missing... there's something I think this isn't capturing... but what is it?" I haven't exactly figured out where that feeling is coming from, so apologies if this comment ends up being incoherent or tangential. But you've inspired me to try to double-click on it, so here it goes :) 

Suppose I meet someone for the first time, and I begin to judge them for reasons that I don't endorse. For example, maybe something about their appearance or their choice of clothing automatically triggers some sort of (unjustified) assumptions about their character. I then notice these thoughts, and upon reflection, I realize I don't endorse them, so I let them go. I don't feel like I'm crushing a part of myself that wants to be heard. If anything, I feel the version of myself that noticed the thoughts & reframed them is a more accurate reflection of the person I am/want to be.

I think EAs sometimes struggle to tell the difference between "true values" and "biases that, if removed, would actually make you feel like you're living a life more consistent with your true values." 

There is of course the tragic case of Alice: upon learning about EA, she crushes her "true values" of creativity and beauty because she believes she's "supposed to" care about impact (and only impact or impact-adjacent things). After reframing her life around impact, she suppresses many forms of motivation and joy she once possessed, and she burns out.

But there is also the fantastic case of Bob: upon learning about EA, he finds a way of clarifying and expressing his "true values" of impact and the well-being of others. He notices many ways in which his behaviors were inconsistent with these values, the ones he values above all else. After reframing his life around impact, he feels much more in-tune with his core sense of purpose, and he feels more motivated than ever before.

My (rough draft) hypothesis is that many EAs struggle to tell the difference between their "core values" and their "biases". 

My guess is that the social and moral pressures to be like Bob are strong, meaning many EAs err in the direction of "thinking too many of their real values are biases" and trying too hard to self-modify. In some cases, this is so extreme that it produces burnout. 

...But there is real value to self-modifying when it's appropriate. Sometimes, you don't actually want to be a photographer, and you would be acting in a way that's truly more consistent with your values (and feel the associated motivational benefits) if you quit photography and spent more time fighting for a cause you believe in.

To my knowledge, no one has been able to write the Ultimate Guide To Figuring Out What You Truly Value. If such a guide existed, I think it would help EAs navigate this tension.

A final thought is that this post seems to describe one component of burnout. I'm guessing there are a lot of other (relatively standard) explanations for why some EAs burnout (e.g., not having strong friendships, not spending enough time with loved ones, attaching their self-worth too much to the opinions of a tiny subset of people, not exercising enough or spending enough time outdoors, working really hard, not having enough interpersonal intimacy, getting trapped in concerns around status, feeling alienated from their families/old friends, navigating other classic challenges of young adulthood). 

Comment by Akash (akash-wasil) on Alexander and Yudkowsky on AGI goals · 2023-01-24T23:36:39.677Z · LW · GW

Thank you for releasing this dialogue-- lots of good object-level stuff here. 

In addition, I think Scott showcased some excellent conversational moves. He seemed very good at prompting Yudkowsky well, noticing his own confusions, noticing when he needed to pause/reflect before continuing with a thread, and prioritizing between topics. 

I hope that some of these skills are learnable. I expect the general discourse around alignment would be more productive if more people tried to emulate some of Scott's magic.

Some examples that stood out to me:

Acknowledging low familiarity in an area and requesting an explanation at an appropriate level:

Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function, keeping in mind that I'm very weak on my understanding of these kinds of algorithms and you might have to explain exactly why they're different in this way?

Acknowledging when he had made progress and the natural next step would be for him to think more (later) on his own:

Okay, I'm going to have to go over all my thoughts on this and update them manually now that I've deconfused that, so I'm going to abandon this topic for now and move on. Do you want to take a break or keep going?

Sending links instead of trying to explain concepts & deciding to move to a new thread (because he wanted to be time-efficient):

How do you feel about me sending you some links later, you can look at them and decide if this is still an interesting discussion, but for now we move on?

Acknowledging the purpose and impact of a "vague question":

I don't think it was a very laser-like consequentialist question, more a vague prompt to direct you into an area where I was slightly confused, and I think it succeeded.

Kudos to Scott. I think these strategies made the discussion more efficient and focused. Also impressive that he was able to do this in a context where he had much less domain knowledge than his conversational partner. 

Comment by Akash (akash-wasil) on "Status" can be corrosive; here's how I handle it · 2023-01-24T03:41:45.119Z · LW · GW

I think this is a reasonable critique.

The particular friend I refer to is unusually good at distilling things in ways that I find actionable/motivating, which might bias me a bit.

But of course it depends on the book and the topic and the person, and it would be unwise to think that most books could be easily summarized like this.

Notably, I think that many of the things that people commonly worry about RE status are easier to summarize than books. Examples:

  • Takeaways from a conference
  • Takeaways from a meeting with High-Status Person TM
  • Takeaways from a Google Doc written by High-Status Person TM

The main exception is when information is explicitly flagged as private. Even in these cases, I think people are often still able to reveal things like "the updates they made" without actually sharing the sensitive information. Or people are allowed the ideas but not their sources (e.g., Chatham House rules)

Comment by Akash (akash-wasil) on “PR” is corrosive; “reputation” is not. · 2023-01-24T02:57:31.923Z · LW · GW

I read this post for the first time in 2022, and I came back to it at least twice. 

What I found helpful

  • The proposed solution: I actually do come back to the “honor” frame sometimes. I have little Rob Bensinger and Anna Salamon shoulder models that remind me to act with integrity and honor. And these shoulder models are especially helpful when I’m noticing (unhelpful) concerns about social status.
  • A crisp and community-endorsed statement of the problem: It was nice to be like “oh yeah, this thing I’m experiencing is that thing that Anna Salamon calls PR.” And to be honest, it was probably helpful tobe like “oh yeah this thing I’m experiencing is that thing that Anna Salamon, the legendary wise rationalist calls PR.” Sort of ironic, I suppose. But I wouldn’t be surprised if young/new rationalists benefit a lot from seeing some high-status or high-wisdom rationalist write a post that describes a problem they experience.
    • Note that I think this also applies to many posts in Replacing Guilt & The Sequences. To have Eliezer Yudkowsky describe a problem you face not only helps you see it; it also helps you be like ah yes, that’s a real/important problem that smart/legitimate people face
  • The post “aged well.” It seems extremely relevant right now (Jan 2023), both for collectives and for individuals. The EA community is dealing with a lot of debate around PR right now. Also, more anecdotally, the Bay Area AI safety scene has quite a strange Status Hierarchy Thing going on, and I think this is a significant barrier to progress. (One might even say that “feeling afraid to speak openly due to vague social pressures” is a relatively central problem crippling the world at scale, as well as our community.)
  • The post is so short!  

What could have been improved 

  • The PR frame. “PR” seems like a term that applies to organizations but not individuals. I think Anna could have pretty easily thrown in some more synonyms/near-synonyms that help people relate more to the post. (I think “status” and “social standing” are terms that I hear the younguns using these days.)
  • Implementation details. Anna could have provided more suggestions for how to actually cultivate the “honor mindset” or otherwise deal with (unproductive) PR concerns. Sadly, humans are not automatically strategic, so I expect many people will not find the most strategic/effective ways to implement the advice in this post.
  • Stories. I think it would’ve been useful for Anna to provide examples of scenarios where she used the “honor” mindset in her own life or navigated the “PR” mindset in her own life. 

But Akash, criticizing posts is easy. Why don’t you try to write your own post that attempts to address some of the limitations you pointed out?

Comment by Akash (akash-wasil) on Transcript of Sam Altman's interview touching on AI safety · 2023-01-20T17:27:02.043Z · LW · GW

Thank you for sharing! I found these two quotes to be the most interesting (bolding added by me):

Yeah that was my earlier point, I think society should regulate what the wide bounds are, but then I think individual users should have a huge amount of liberty to decide how they want their experience to go. So I think it is like a combination of society -- you know there are a few asterisks on the free speech rules -- and society has decided free speech is not quite absolute. I think society will also decide language models are not quite absolute. But there is a lot of speech that is legal that you find distasteful, that I find distasteful, that he finds distasteful, and we all probably have somewhat different definitions of that, and I think it is very important that that is left to the responsibility of individual users and groups. Not one company. And that the government, they govern, and not dictate all of the rules.

And the bad case -- and I think this is important to say -- is like lights out for all of us. I'm more worried about an accidental misuse case in the short term where someone gets a super powerful -- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like. But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening. 

But I think it's more subtle than most people think. You hear a lot of people talk about AI capabilities and AI alignment as in orthogonal vectors. You're bad if you're a capabilities researcher and you're good if you're an alignment researcher. It actually sounds very reasonable, but they're almost the same thing. Deep learning is just gonna solve all of these problems and so far that's what the progress has been. And progress on capabilities is also what has let us make the systems safer and vice versa surprisingly. So I think none of the sort of sound-bite easy answers work.

Comment by Akash (akash-wasil) on AGI safety field building projects I’d like to see · 2023-01-20T15:54:54.311Z · LW · GW

Ah, thanks for the clarifications. I agree with the clarified versions :)

Quick note on getting senior researchers:

  • It seems like one of the main bottlenecks is "having really good models of alignment."
  • It seems plausible to me that investing in junior alignment researchers today means we'll increase the number of senior alignment researchers (or at least "people who are capable of mentoring new alignment researchers, starting new orgs, leading teams, etc.).
  • My vibes-level guess is that the top junior alignment researchers are ready to lead teams within about a year or two of doing alignment research on their own. EG I expect some people in this post to be ready to mentor/advise/lead teams in the upcoming year. (And some of them already are).
Comment by Akash (akash-wasil) on AGI safety field building projects I’d like to see · 2023-01-20T14:00:32.261Z · LW · GW

A few thoughts:

  1. I agree that it would be great to have more senior researchers in alignment

  2. I agree that, ideally, it would be easier for independent researchers to get funding.

  3. I don’t think it’s necessarily a bad thing that the field of AI alignment research is reasonably competitive.

  4. My impression is that there’s still a lot of funding (and a lot of interest in funding) independent alignment researchers.

  5. My impression is that it’s still considerably easier to get funding for independent alignment research than many other forms of independent non-commercial research. For example, many PhD programs have acceptance rates <10% (and many require that you apply for independent grants or that you spend many of your hours as a teaching assistant).

  6. I think the past ~2 months has been especially tough for people seeking independent funding, given that funders have been figuring out what to do in light of the FTX stuff & have been more overwhelmed than usual.

  7. I am concerned that, in the absence of independent funding, people will be more inclined to join AGI labs even if that’s not the best option for them. (To be clear, I think some AGI lab safety teams are doing reasonable work. But I expect that they will obtain increasingly more money/prestigé in the upcoming years, which could harm peoples’ ability to impartially assess their options, especially if independent funding is difficult to acquire).

Overall, empathize with concerns about funding, but I wish the narrative included (a) the fact that the field is competitive is not necessarily a bad thing and (b) funding is still much more available than for most other independent research fields.

Finally, I think part of the problem is that people often don’t know what they’re supposed to do in order to (honestly and transparently) present themselves to funders, or even which funders they should be applying to, or even what they’re able to ask for.. If you’re in this situation, feel free to reach out! I often have conversations with people about career & funding options in AI safety (Disclaimer: I’m not a grantmaker).

Comment by Akash (akash-wasil) on OpenAI’s Alignment Plan is not S.M.A.R.T. · 2023-01-18T20:27:56.714Z · LW · GW

Thanks for writing this up. I agree with several of the subpoints you make about how the plan could be more specific, measurable, etc. 

I'm not sure where I stand on some of the more speculative (according to me) claims about OpenAI's intentions. Put differently, I see your post making two big-picture claims: 

  1. The 1-2 short blog posts about the OpenAI plan failed to meet several desired criteria. Reality doesn't grade on a curve, so even though the posts weren't intended to spell out a bunch of very specific details, we should hold the world's leading AGI company to high standards, and we should encourage them to release SMARTer and more detailed plans. (I largely agree with this)  
  2. OpenAI is alignment-washing, and their safety efforts are not focused on AI x-risk. (I'm much less certain about this claim and I don't think there's much evidence presented here to support it). 

IMO,  the observation that OpenAI's plan isn't "SMART" could mean that they're alignment-washing. But it could also simply mean that they're working toward making their plan SMARTer, and they're working toward making their plans more specific/measurable, but they wanted to share what they had so far (which I commend them for). Similarly, the fact that OpenAI is against pivotal-acts could mean that they're not taking the "we need to escape the critical risk period" goal seriously or it could mean that they reject one particular way of escaping the acute risk period, and they're trying to find alternatives. 

I also think I have some sort of prior that goes something like "you should have a high bar for confidently claiming that someone isn't pursuing the same goal as you, just because their particular approach to achieving that goal isn't yet specific/solid.

I'm also confused though, because you probably have a bunch of other pieces of evidence going into your model of OpenAI, and I don't believe that everyone should have to write up a list of 50 reasons in order to criticize the intentions of a lab.

All things considered, I think I land somewhere like "I think it's probably worth acknowledging more clearly that the accusations about alignment-washing are speculative, and the evidence in the post could be consistent with an OpenAI that really is trying hard to solve the alignment problem. Or acknowledge that you have other reasons for believing the alignment-washing claims that you've decided not to go into in the post."

Comment by Akash (akash-wasil) on How To Get Into Independent Research On Alignment/Agency · 2023-01-16T07:49:51.031Z · LW · GW

Reviewing this quickly because it doesn't have a review.

I've linked this post to several people in the last year. I think it's valuable for people (especially junior researchers or researchers outside of major AIS hubs) to be able to have a "practical sense" of what doing independent alignment research can be like, how the LTFF grant application process works, and some of the tradeoffs of doing this kind of work. 

This seems especially important for independent conceptual work, since this is the path that is least well-paved (relative to empirical work, which is generally more straightforward to learn, or working at an organization, where one has colleagues and managers to work with). 

I also appreciate John's emphasis of focusing on core problems & his advice to new researchers:

Probably the most common mistake people make when first attempting to enter the alignment/agency research field is to not have any model at all of the main bottlenecks to alignment, or how their work will address those bottlenecks. The standard (and strongly recommended) exercise to alleviate that problem is to start from the Hamming Questions:

  • What are the most important problems in your field (i.e. alignment/agency)?
  • How are you going to solve them?

I expect I'll continue to send this to people interested in independent alignment work & it'll continue to help people go from "what the heck does it mean to get a grant to do conceptual AIS work?" to "oh, gotcha... I can kinda see what that might look like, at least in this one case... but seeing even just one case of this makes the idea feel much more real."

Comment by Akash (akash-wasil) on ARC's first technical report: Eliciting Latent Knowledge · 2023-01-16T07:38:22.035Z · LW · GW

ELK was one of my first exposures to AI safety. I participated in the ELK contest shortly after moving to Berkeley to learn more about longtermism and AI safety. My review focuses on ELK’s impact on me, as well as my impressions of how ELK affected the Berkeley AIS community.

Things about ELK that I benefited from

Understanding ARC’s research methodology & the builder-breaker format. For me, most of the value of ELK came from seeing ELK’s builder-breaker research methodology in action. Much of the report focuses on presenting training strategies and presenting counterexamples to those strategies. This style of thinking is straightforward and elegant, and I think the examples in the report helped me (and others) understand ARC’s general style of thinking.

Understanding the alignment problem. ELK presents alignment problems in a very “show, don’t tell” fashion. While many of the problems introduced in ELK have been written about elsewhere, ELK forces you to think through the reasons why your training strategy might produce a dishonest agent (the human simulator) as opposed to an honest agent (the direct translator). The interactive format helped me more deeply understand some of the ways in which alignment is difficult. 

Common language & a shared culture. ELK gave people a concrete problem to work on. A whole subculture emerged around ELK, with many junior alignment researchers using it as their first opportunity to test their fit for theoretical alignment research. There were weekend retreats focused on ELK. It was one of the main topics that people were discussing from Jan-Feb 2022. People shared their training strategy ideas over lunch and dinner. It’s difficult to know for sure what kind of effect this had on the community as a whole. But at least for me, my current best-guess is that this shared culture helped me understand alignment, increased the amount of time I spent thinking/talking about alignment, and helped me connect with peers/collaborators who were thinking about alignment. (I’m sympathetic, however, to arguments that ELK may have reduced the amount of independent/uncorrelated thinking around alignment & may have produced several misunderstandings, some of which I’ll point at in the next section). 

Ways I think ELK could be improved

Disclaimer: I think each of these improvements would have been (and still is) time-consuming, and I don’t think it’s crazy for ARC to say “yes, this we could do this, but it isn't worth the time-cost.” 

More context. ELK felt like a paper without an introduction or a discussion section. I think it could've benefitted from more context about on why it's important, how it relates to previous work, how it fits into a broader alignment proposal, and what kinds of assumptions it makes.

  • Many people were confused about how ELK fits into a broader alignment plan, which assumptions ELK makes, and what would happen if ARC solved ELK. Here are some examples of questions that I heard people asking:
    • Is ELK the whole alignment problem? If we solve ELK, what else do we need to solve?
    • How did we get the predictor in the first place? Does ELK rely on our ability to build a superintelligent oracle that hasn’t already overpowered humanity? 
    • Are we assuming that the reporter doesn’t need to be superintelligent? If it does need to be superintelligent (in order to interpret a superintelligent predictor), does that mean we have to solve a bunch of extra alignment problems in order to make sure the reporter doesn’t overpower humanity? 
    • Does ELK actually tackle the “core parts” of the alignment problem? (This was discussed in this post (released 7 months after the ELK report), and this post (released 9 months after ELK) by Nate Soares. I think the discourse would have been faster, of higher-quality, and invited people other than Nate if ARC had made some of its positions clearer in the original report). 
  • One could argue that it’s not ARC’s job to explain any of this. However, my impression is that ELK had a major influence on how a new cohort of researchers oriented toward the alignment problem. This is partially because of the ELK contest, partially because ELK was released around the same time as several community-building efforts had ramped up, and partially because there weren't (and still aren’t) many concrete research problems to work on in alignment research.
  • With this in mind, I think the ELK report could have done a better job communicating the “big-picture” for readers. 

More justification for focusing on worst-case scenarios. The ELK report focuses on solving ELK in the worst case. If we can think of a single counterexample to a proposal, the proposal breaks. This seems strange to me. It feels much more natural to think about ELK proposals probabilistically, ranking proposals based on how likely they are to reduce the chance of misalignment. In other words, I broadly see the aim of alignment researchers as “come up with proposals that reduce the chance of AI x-risk as much as possible” as opposed to “come up with proposals that would definitely work.” 

While there are a few justifications for this in the ELK report, I didn’t find them compelling, and I would’ve appreciated more discussion of what an alternative approach would look like. For example, I would’ve found it valuable for the authors to (a) discuss their justification for focusing on the worst-case in more detail, (b) discuss what it might look like for people to think about ELK in “medium-difficulty scenarios”, (c) understand if ARC thinks about ELK probabilistically (e.g., X solution seems to improve our chance of getting the direct translator by ~2%), and (d) have ARC identify certain factors that might push them away from working on worst-case ELK (e.g., if ARC believed AGI was arriving in 2 years and they still didn’t have a solution to worst-case ELK, what would they do?)

Clearer writing. One of the most common complaints about ELK is that it’s long and dense. This is understandable; ELK conveys a lot of complicated ideas from a pre-paradigmatic field, and in doing so it introduces several novel vocabulary words and frames. Nonetheless, I would feel more excited about a version of ELK that was able to communicate concepts more clearly and succinctly. Some specific ideas include offering more real-world examples to illustrate concepts, defining terms/frames more frequently, including a glossary, and providing more labels/captions for figures. 

Short anecdote 

I’ll wrap up my review with a short anecdote. When I first began working on ELK (in Jan 2022), I reached out to Tamera (a friend from Penn EA) and asked her to come to Berkeley so we could work on ELK together. She came, started engaging with the AIS community, and ended up moving to Berkeley to skill-up in technical AIS. She’s now a research resident at Anthropic who has been working on externalized reasoning oversight. It’s unclear if or when Tamera would’ve had the opportunity to come to Berkeley, but my best-guess is that this was a major speed-up for Tamera. I’m not sure how many other cases there were of people getting involved or sped-up by ELK. But I think it’s a useful reminder that some of the impact of ELK (whether positive or negative) will be difficult to evaluate, especially given the number of people who engaged with ELK (I’d guess at least 100+, and quite plausibly 500+).