Apollo Research 1-year update

post by Marius Hobbhahn (marius-hobbhahn), Lee Sharkey (Lee_Sharkey), Lucius Bushnaq (Lblack), Dan Braun (Daniel Braun), Mikita Balesni (mykyta-baliesnyi), Jérémy Scheurer (JerrySch), Nicholas Goldowsky-Dill (nicholas-goldowsky-dill), StefanHex (Stefan42), jake_mendel, AlexMeinke (Paulawurm), rusheb · 2024-05-29T17:44:32.484Z · LW · GW · 0 comments

Contents

  About Apollo Research
  Executive Summary
  Completed work
    Evaluations
    Interpretability
    Governance
  Current & Future work
    Evaluations
    Interpretability
    Governance
  Operational Highlights
      Financials / Operations
  Challenges
  Forward Look
None
No comments

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research

About Apollo Research

Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.

Executive Summary

For the UK AI Safety Summit, we developed a demonstration that Large Language Models (LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g. AI Insight forumBBCBloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.

The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms, and a responsible scaling policy (RSP) on deceptive alignment. Our goal is to help governments and AI developers understand, assess, and address the risks of deceptively aligned AI systems.

The interpretability team published three papers: An improved training method for sparse dictionary learninga new conceptual framework for ‘loss-landscape-based interpretability’, and an associated empirical paper. We are beginning to explore concrete white-box evaluations for deception and continue to work on fundamental interpretability research.

The governance team communicates our technical work to governments (e.g., on evaluations, AI deception and interpretability), and develops recommendations around our core research areas for international organizations and individual governments.

Apollo Research works with several organizations, including partnering with the UK AISI and being a member of the US AISI Consortium. As part of our partnership with UK AISI, we were contracted to develop deception evaluations. Additionally, we engage with various AI labs, e.g. red-teaming OpenAI’s fine-tuning API before deployment and consulting on the deceptive alignment section of an AI lab’s RSP.

Like any organization, we have also encountered various challenges. Some projects proved overly ambitious, resulting in delays and inefficiencies. We would have benefitted from having dedicated regular exchanges with senior official external advisors earlier. Additionally, securing funding took more time and effort than expected.

We have more room for funding. Please reach out if you’re interested.

Completed work

Evaluations

For the UK AI Safety Summit, we developed a demonstration that LLMs can strategically deceive their primary users when put under pressure, which was presented at the UK AI Safety Summit. It was referenced by experts and the press (e.g. Yoshua Bengio’s statement for Senator Schumer’s AI insight forumBBCBloomberg, US Security and Exchange Commission Chairperson Gary Gensler’s speech on AI and law, and many other media outlets). It was accepted for an oral presentation at this year’s ICLR LM agents workshop.

In our role as an independent third-party evaluator, we work with a range of organizations. For example, we were contracted by the UK AISI to build deceptive capability evaluations with them. We also worked with OpenAI to red-team their fine-tuning API before deployment. 

We published multiple conceptual research pieces on evaluations, including A Causal Framework for AI Regulation and Auditing and A Theory of Change for AI Auditing. Furthermore, we published conceptual clarifications on deceptive alignment and strategic deception

We were part of multiple collaborations, including:

To grow the field of evaluations and to increase its accessibility, we engaged in field building. For example, we wrote a starter guide on evaluations and argued for the necessity of a Science of Evals. Furthermore, we gave multiple lectures and workshops on evaluations and mentored scholars in programs like MATS

To support our technical efforts, we developed an evals software stack that makes it easy to build and run benchmarks and LM agent evaluations efficiently at scale.

Interpretability

Following some of our members’ early work [AF · GW] on sparse autoencoders (SAEs), we continued to make progress in this direction:

We published a new interpretability framework and a method called the Local Interaction Basis (LIB). This was published as two papers:

While our LIB project mainly yielded negative empirical results, we think the theory is useful and may provide a basis for future interpretability methods. We also learned some important lessons during this project; for example, we prematurely tried to scale the project to large models before having sufficient certainty about toy models. 

Governance

Our governance team helps our organization communicate our technical work to decision-makers and inform policy discussions. The team is currently focused on direct government engagements and on the research and development of policy recommendations. As part of our policy research:

As part of our active engagements with governments:

Current & Future work

Evaluations

With our evals work, we aim to help governments, AI developers, and civil society better understand and assess the risks of AI deception. Additionally, we want to make it easier to prepare for and address these risks. To this end, we’re working on four concrete projects:

Interpretability

We continue to think that interpretability and white-box evaluations will be an important step in addressing risks in AI systems, especially deception. Therefore, we think the following two areas are likely to be priorities. 

Governance

Operational Highlights

Team Growth and Changes: We went from 7 to 15 FTE in ~8 months. While this was ambitious, we developed robust processes and prioritized having those in place to allow for that growth. Due to that, we have been able to onboard new employees quickly, and they are typically ready to contribute to our core projects within 1 or 2 weeks. 

Financials / Operations

Challenges

While we have seen some success, we also made a number of mistakes and faced challenges. 

Due to concerns about significant capability advancement externalities from publishing our research, we didn’t share our interpretability work for the first six months. Foregoing external feedback incurred larger costs than anticipated. We’re unsure if this was the right call given the information we had at the time. We expect that we will be more open with future projects and intend to invite more collaboration and feedback early on.

Some of our projects might have been too widely scoped. For example, we were trying to develop an entirely new interpretability framework and accompanying technique from scratch in ~9 months. This made it hard to verify individual components independently and it subsequently took longer than expected to get signal on the entire project. We started our model organism project with a very ambitious goal: showing how a model could become deceptively aligned from outcome-based training end-to-end. This turned out to be a much larger project than anticipated, even with our expectations at the start. Therefore, we have scaled it down drastically to tackle a much smaller and well-scoped subpart of the original research question. We continue to think that aiming for ambitious projects is the right approach. But in the future, we will attempt to balance this with reacting faster and scaling down to a smaller-scoped problem when the ambitious version turns out too hard.

We should have had official advisors earlier. During the first 9 months, we regularly talked to subject-matter experts in an unofficial capacity. However, we never had any explicit official advisors. In retrospect, this was a mistake since more clearly defined advisorships would have helped to ensure better and more detailed feedback. As of April 2024, we have three official advisors with whom we regularly meet: Tom McGrath, Owain Evans, and Daniel Kokotajlo. We are thankful for their contribution to our work and have already noticed the benefits of their advice.

Leadership spent a significant portion of their time on fundraising. This took up much more bandwidth than expected and came at the cost of executing object-level work. Our funding constraint also limited our ability to hire and retain talent, some of whom took up industry roles instead, as we could not make offers to exceptional talent sufficiently quickly and with certainty nor match salary expectations. 

Forward Look

Apollo’s goals for the near future are:

0 comments

Comments sorted by top scores.