ARENA 4.0 Impact Report

post by Chloe Li (chloe-li-1), JamesH (AtlasOfCharts), James Fox · 2024-11-27T20:51:54.844Z · LW · GW · 2 comments

Contents

  Summary
  Programme Information
    ARENA 4.0 Programme
    Main Changes
  Method
  Criteria 1: Sourcing high-quality participants[1]
    Selection process
    Who we selected
    Improvements
  Criteria 2: Upskilling
    Week 0: Fundamentals
    Week 1: Mechanistic Interpretability
    Week 2: LLM Evaluations
    Week 3: Reinforcement Learning
    Overall Learning Experience
  Criteria 3: Integration
        Connections to/feeling like a part of the AI safety community
        Access to top researchers
        Meeting like-minded talented people
        Motivation to learn and get unstuck
        Immediate access to TAs
  Criteria 4: Career Acceleration
  Overall Programme Experience
      Most valuable gain
        Meeting talented and like-minded people
        ML skills and knowledge
        Confidence to take on AI safety work
        Ability to quickly execute on a project
        Capstone Project
  Improvements
  Acknowledgments
None
2 comments

If you're interested in helping to run the ARENA program, note that we're currently hiring for an Operations Lead! For more details, and to apply, see here.

Summary

The purpose of this report is to evaluate ARENA 4.0’s impact according to our four success criteria:

  1. Source high-quality participants
  2. Upskill these talented participants in ML skills for AI safety work
  3. Integrate participants with the existing AI safety community and legitimise AI safety as a compelling field to work in
  4. Accelerate participants’ career transition into AI safety

Overall, this iteration of ARENA was successful according to our success criteria.

  1. Criteria 1: Our participants were of high calibre, with ten having more than 4 years of experience as software engineers at top companies and four having a PhD-level academic qualification or higher.
  2. Criteria 2: Our in-person programme lasts 4 weeks. The majority of participants felt they achieved the challenging concrete learning goals for each topic. Participants estimated the counterfactual time needed to learn the full ARENA content outside the programme as 10.2 weeks (even if they had access to ARENA’s material). We were particularly impressed with the capstone projects completed in the programme’s final week, two of which can be found here: capstone 1 [LW · GW], capstone 2 [LW · GW]. We expect to see even more great work in the future!
  3. Criteria 3: Participants rated the value of being in the LISA [LW · GW] environment as 8.9/10. The top cited “most valuable gain” from the programme was meeting talented and like-minded peers in the AI safety community, emphasising the importance of running the programme in person at an AI safety hub. We are also glad that many participants commented on now feeling confident to take on ML safety research projects independently.
  4. Criteria 4: Four participants had full-time AI safety roles at the end of the programme, and a further 24/33 participants planned to or were actively applying to AI safety roles at the end of the programme.

The structure of the report is as follows:

Programme Information

First, we outline when this programme occurred, what topics were covered, and the main changes made to the programme in contrast to previous iterations. For more information about our curriculum content, see our website.

ARENA 4.0 Programme

ARENA 4.0 ran from the 2nd of September until the 4th of October 2024. The schedule of the programme was as follows:

Main Changes

The main changes for ARENA 4.0 compared with ARENA 3.0 (ran in Q1 2024) were:

Method

We surveyed our participants at the programme's start (on day 1) and at the end (on the last day). Our impact analysis is mainly based on responses at the end of the programme, while some are based on comparing the changes in responses between the start and the end.

We collected three types of responses:

We evaluated open-ended responses using thematic analysis. We highlighted keywords in each response, identified recurring themes and patterns across responses, reviewed the themes, and then counted the frequency of each theme across participant responses. Each count comes from a different participant, but each participant can add to multiple theme counts if their response mentions them.

Criteria 1: Sourcing high-quality participants[1]

Overall, our selection procedure worked effectively. ARENA 4.0 had a geographically diverse cohort of high quality participants, with 10 participants having more than four years experience in software engineering, and four having or conducting a PhD.

Selection process

Initial applications for ARENA opened on the 6th of July 2024 and closed on the 20th of July. The coding test ran from the 25th of July until the 29th of July (with an extension until the 1st of August). Interviews ran from the 4th of August until the 7th of August.

Who we selected

We selected 33 participants from ~350 applications. Participants came from across the world, including the US, UK, EU, India, and South Africa. The current jobs of participants can be seen in Figure 1.

Figure 1
Figure 1: Current careers of participants (at the start of the programme)[2]

As shown in Figure 2, the highest academic degree for 55% of participants was a Bachelor’s degree, for 34% a Master’s degree, and for 13% a PhD.

Figure 2: Highest academic degree of participants

Improvements

The short notice period (2 months from announcement to start of programme, with only 1 month for selection) needed to be increased. This likely caused us to lose out on strong talent. Several participants gave late applications we accepted after deadlines, and several participants mentioned that the application timeline felt very tight. Future iterations will benefit from a more consistent schedule with longer lead times.

Criteria 2: Upskilling

As an ML program, our core goal is to upskill participants to tackle technical problems in AI safety. The first four weeks of the ARENA in-person programme cover four technical topics (more detail on each topic is provided in the relevant sections):

Each topic lasts a week. We asked participants to rate these weeks according to three criteria at the end of the programme:

Overall, participants found exercises both challenging and enjoyable. Participants rated ARENA exercises at 8.5/10 for enjoyment and 7.5/10 for difficulty. The total time participants estimated it would’ve taken them to learn the materials outside the programme (without TAs, lectures, pair-programming) was 10.2 weeks. This shows ARENA is successfully achieving its upskilling goals and emphasises the importance of selection, as even our high-calibre participants found the exercises quite challenging.

Week 0: Fundamentals

The aim of this week is for participants to reinforce basic deep-learning concepts. This week had 25 participants, as it was optional for those with significant deep-learning experience. Topics covered include PyTorch, basics of neural networks, residual neural networks, CNNs, weights and biases, optimisation, and backpropagation.

Participants said, on average, that it would take 2.3 weeks to learn the deep learning fundamentals week’s materials on their own if they had access to ARENA materials, as seen in Figure 3 below.

Figure 3: Counterfactual learning time of fundamentals week

Week 1: Mechanistic Interpretability

The aim of this week is for participants to understand some of the methods that can be used to analyse model internals and replicate the results from key interpretability papers. Topics covered include the following: GPT models, training and sampling from transformers, TransformerLens, induction heads, indirect object identification, superposition, linear probes, inference-time intervention, and sparse autoencoders.

We asked participants to rate out of 7 their confidence in doing three concrete tasks in mechanistic interpretability (Figure 4); on average, participants rated their ability in these concrete outcomes at 5.64. Participants said, on average, that it would take 3.7 weeks to learn the mechanistic interpretability week’s materials on their own, as seen in Figure 5.

Figure 4: Concrete learning outcomes for mechanistic interpretability
Figure 5: Counterfactual learning time of mechanistic interpretability week

Week 2: LLM Evaluations

A new week of content on LLM evaluations was developed for this iteration of ARENA. This content aimed for participants to build alignment and dangerous capability evaluations in multiple-choice and agentic settings, and understand how to use these evaluations to gain information about current frontier LLMs. We added this week because evals have become a crucial field of AI safety, and there are no high-quality educational materials on safety evals to our knowledge. Topics covered include the following: threat modeling, using LLM APIs, implementing a pipeline to generate questions using LLMs, UK AISI’s inspect library, implementing LLM agents, and scaffolding LLM agents.

Overall, participants achieved the concrete learning outcomes to a similar extent as the other weeks, which we see as a positive sign for the first time running this content! On average, participants rated their confidence in designing multiple-choice question evals and agent evals as 6.0/7 and said it would take 1.6 weeks to learn this content independently (see Figures 6 and 7).

Regarding learning experience, participants rated this week an average enjoyment of 4.9/7 and an average usefulness of 5.0/7 across the days (based on responses from 12 participants out of 33 on an LLM evals feedback form). Participants particularly enjoyed the content on agentic dangerous-capabilities evaluations and the conceptual thinking of day 1 on threat-modeling and eval design.

We identified several areas of improvement to the learning experience. The most commonly cited issue in enjoyment ratings was code bugs in the materials (although this is sometimes unavoidable when producing new content). Second, we identified a more valuable set of learning outcomes from running this content on participants (in particular, a reduced emphasis on manual prompting of LLMs and an increased emphasis on designing evals infrastructure and agentic evals). The materials would benefit from a set of changes to put more emphasis on these learning outcomes, including adding clarifications and exercises on key concepts, restructuring the content to achieve a better flow, and adding additional content to challenge participants further.

Figure 6: Concrete learning outcomes for LLM evaluations week
Figure 7: Counterfactual learning time of LLM evaluations week

Week 3: Reinforcement Learning

This week's core aim is for participants to understand classical and deep RL methods and how RLHF is implemented on LLMs as the dominant alignment method used today. Topics covered include the following: Fundamentals of RL, gym & gymnasium environments, policy gradient optimisation, PPO, deep Q-learning, RLHF, HuggingFace, and fine-tuning LLMs.

We asked participants to rate out of 7 their confidence in doing four concrete tasks in RL; the results can be seen in Figure 8. Participants rated their ability in these concrete learning outcomes at 5.44 on average. We asked participants to rate out of 7 their confidence in doing the following concrete tasks in RL. Participants said, on average, that it would take 3.2 weeks to learn the RL week’s materials on their own.

Figure 8: Concrete learning outcomes for RL week
Figure 9: Counterfactual learning time for RL week

Overall Learning Experience

Finally, we asked participants how they found the ARENA materials overall. This helps us calibrate participant calibre across different ARENA cohorts and elicit feedback on the quality of our teaching mechanisms. On average, participants rated 8.5 out of 10 for exercise enjoyment, 7.6 out of 10 for exercise difficulty, and 8.3 out of 10 for teaching quality.

Figure 10: Overall exercise enjoyment across all weeks
Figure 11: Overall exercise difficulty across all weeks
Figure 12: TA and lecture rating

Criteria 3: Integration

Our participants spent 4 to 5 weeks full-time in the LISA office in London. Overall, they enjoyed their time there! Participants rated on average 8.9/10 for “How valuable was being in the LISA environment” (see Figure 13).  

The most cited value from participants’ open-ended feedback was feeling that they were part of the AI safety community in ways they had not been before (see Figure 14). We were particularly excited to see a few outcomes mentioned by participants: (1) that they met a group of like-minded peers who are at similar career stages and who can help/motivate each other to progress further, (2) that they met senior researchers who helped give future directions and legitimise the field. We see these as two of the most impactful values of ARENA. They underline the importance of running the program in person in an AI safety hub like LISA. We are incredibly glad to be able to make use of — and contribute to — LISA’s thriving community of AI safety researchers with the ARENA programme!

Figure 13: Participants’ rating of the value of the LISA environment
Figure 14: Analysis of participants’ comments on what was particularly valuable about LISA

Here are some representative quotes that illustrate these themes:

Connections to/feeling like a part of the AI safety community

Access to top researchers

Meeting like-minded talented people

Motivation to learn and get unstuck

Immediate access to TAs

Criteria 4: Career Acceleration

Finally, ARENA aims to accelerate participants' AI safety careers. We’re excited about the career outcomes for this cohort. Most participants (~70%) are actively pursuing AI safety careers, either in the planning or application stages (see Figure 16). Even more exciting: four participants have already secured full-time AI safety positions starting within the next four months. These results suggest that ARENA is successfully achieving one of its core goals of providing talented individuals with the skills to go directly into AI safety work. We’re especially encouraged to see participants moving directly into impactful roles so quickly after the conclusion of ARENA.

We also saw a difference in participants’ confidence in AI safety being the right field for them. At the start of the programme, participants rated on average of 7.7 out of 10 for “How confident are you that AI safety is the right field for you?” and 8.2 out of 10 by the end of the programme. The between-person difference across the programme was, on average, +0.6 (see Figure 15). This demonstrates the impact that ARENA has in increasing people’s confidence on whether AI safety is a good personal fit for them.

Two participants had reduced their certainty that AI safety is the right field for them. Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way. We expect a few to decrease their confidence, to validate that ARENA’s materials are technical and challenging enough to test for fit.

Figure 15: Participants’ AI safety career confidence
Figure 16: Participants’ current AI safety career situation (end of programme)

Overall Programme Experience

We asked the participants, “How much did you enjoy the programme overall?” at the end. The average participant rating was 9.11 out of 10.

Figure 17: Overall programme enjoyment

Most valuable gain

We asked participants, “What was the most valuable thing you gained from the programme?” and thematically analysed their open-ended responses. We identified the following common themes.

Figure 18: Analysis of what participants found most valuable about ARENA

Notably, the most frequently mentioned “most valuable gain” was meeting talented and like-minded people. This reflects the value of LISA mentioned in the previous section and underlines the importance of running ARENA in person to achieve our third success criteria of integrating participants within the AI safety ecosystem. Furthermore, we are enthused that the third most commonly cited valuable gain was “confidence to take on ML work” — this was one of the principal outcomes that we aimed for under our upskilling goal — the programme is only 5-weeks, so an impactful upskilling outcomes is if participants gain confidence to be able to bootstrap themselves to learn at a faster rate in the future.

Here are some representative responses for each of our core themes:

Meeting talented and like-minded people

ML skills and knowledge

Confidence to take on AI safety work

Many responses specifically mentioned a boost in their confidence in undertaking AI safety work. We’ve separated this out from references of particular AI safety knowledge and experience (“ML skills and knowledge”).

Ability to quickly execute on a project

Capstone Project

Improvements

As a team, we endeavour to use feedback to improve the quality of ARENA for participants. Each iteration, we learn how to better run the programme so that its impact can grow for all of our participants. Although this programme was overall successful according to its four success criteria, we noticed some core improvements that would enable it to run even better. The key improvements we noticed in this iteration are:

Acknowledgments

This report was produced by @Chloe Li [LW · GW] (data analysis, writing), @JamesH [LW · GW] (writing), and @James Fox [LW · GW] (reviewing, editing) at ARENA. We thank @CallumMcDougall [LW · GW] for his comments on the final draft. We also thank Open Philanthropy for their generous support of the ARENA program.

  1. ^

     Note: Some details of this section have been redacted, in order that key details and aspects of how we select and choose participants remain private to avoid potential issues that may arise in our selection process in the future.

  2. ^

     Note: “Conducting alignment research” only includes those who are currently working full-time on alignment research (independently, as mentee/intern, or employed), not those who have in the past or are working part-time on alignment. This was not self-reported by the participants, but annotated by us based on their CV, so there may be some inaccuracies.

2 comments

Comments sorted by top scores.

comment by Jonathan Claybrough (lelapin) · 2024-11-28T21:06:44.661Z · LW(p) · GW(p)

Congratz on your successes and thank you for publishing this impact report. 

It leaves me unsatiated related to cost effectiveness though. With no idea of how much money was invested in this project to get this outcome, I don't know if Arena is cost effective compared to other training programs and counterfactual opportunities. Would you mind sharing at least something about the amount of funding this got? 

Re 

Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way.

it doesn't strike me that a 5 week all expenses paid program is a particularly low cost way to find out AI Safety isn't for you (as compared to for example participating in an Apart Hackathon)

Replies from: James Fox
comment by James Fox · 2024-12-03T04:41:18.222Z · LW(p) · GW(p)

Thank you for your comment.

We are confident that ARENA's in-person programme is among the most cost-effective technical AI safety training programmes: 
- ARENA is highly selective, and so all of our participants have the latent potential to contribute meaningfully to technical AI safety work
- The marginal cost per participant is relatively low compared to other AI safety programmes since we only cover travel and accommodation expenses for 4-5 weeks (we do not provide stipends)
- The outcomes set out in the above post seem pretty strong (4/33 immediate transitions to AI safety roles and 24/33 more actively pursuing them)
- There are lots of reasons why technical AI safety engineering is not the right career fit for everyone (even those with the ability). Therefore, I think that 2/33 people updating against working in AI safety after the programme is actually quite a low attrition rate. 
- Apart Hackathons have quite a different theory of change compared with ARENA. While hackathons can be valuable for some initial exposure, ARENA provides 4-weeks of comprehensive training in cutting-edge AI safety research (e.g., mechanistic interpretability, LLM evaluations, and RLHF implementation) that leads to concrete outputs through week-long capstone projects.