AISC 2024 - Project Summaries

post by NickyP (Nicky) · 2023-11-27T22:32:23.555Z · LW · GW · 3 comments

Contents

  List of AISC Projects
  To not build uncontrollable AI
    1. Towards realistic ODDs for foundation model based AI offerings
    2. Luddite Pro: information for the refined luddite
    3. Lawyers (and coders) for restricting AI data laundering
    4. Assessing the potential of congressional messaging campaigns for AI
  Mechanistic-Interpretability
    5.  Modelling trajectories of language models
    6. Towards ambitious mechanistic interpretability
    7. Exploring toy models of agents
    8. High-level mechanistic interpretability and activation engineering library
    9. Out-of-context learning interpretability
    10. Understanding search and goal representations in transformers
  Evaluating and Steering Models
    11. Benchmarks for stable reflectivity
    12. SADDER: situational awareness datasets for detecting extreme risks
    13. TinyEvals: how language models speak coherent English?
    14. Evaluating alignment evaluations
    15. Pipelines for evaluating and steering LLMs towards faithful reasoning
    16. Steering of LLMs through addition of activation vectors with latent ethical valence
  Agent Foundations
    17. High actuation spaces
    18. Does sufficient optimization imply agent structure?
    19. Discovering agents in raw bytestreams
    20. The science algorithm
  Miscellaneous Alignment Methods
    21. SatisfIA – AI that satisfies without overdoing it
    22. How promising is automating alignment research? (literature review)
    23. Personalized fine-tuning token for AI value alignment
    24. Self-other overlap @AE Studio
    25. Asymmetric control in LLMs: model editing and steering that resists control for unalignment
    26. Tackling key challenges in Debate
  Other
    27. AI-driven economic safety nets: restricting the macroeconomic disruptions of AGI deployment
    28. Policy-based access to powerful models
    29. Organise the next Virtual AI Safety Unconference
  Apply Now
None
3 comments

Apply to AI Safety Camp 2024 by 1st December 2023. All mistakes here are my own.

Below are some summaries for each project proposal, listed in order of how they appear on the website. These are edited by me, and most have not yet been reviewed by the project leads. I think having a list like this makes it easier for people to navigate all the different projects, and the original post [LW · GW]/website did not have one, so I made this.

If a project catches your interest, click on the title to read more about it.  

Note that the summarisation here is lossy. The desired skills as here may be misrepresented, and if you are interested, you should check the original project for more details. In particular, many of the "desired skills" are often listed such that having only a few would be helpful, but this isn't consistent.

 

List of AISC Projects

To not build uncontrollable AI

1. Towards realistic ODDs for foundation model based AI offerings

Project Lead: Igor Krawczuk

Goal: Current methods for alignment applied to language models is akin to "blacklisting" behaviours that are bad. Operational Design Domain (OOD) is instead, akin to more exact "whitelisting" design principles, and now allowing deviations from this. The project wants to build a proof of concept, and show that this is hopefully feasible, economical and effective.

Team (Looking for 4-6 people):

 

2. Luddite Pro: information for the refined luddite

Project Lead: Brian Penny

Goal: Develop a news website filled with stories, information, and resources related to the development of artificial intelligence in society. Cover specific stories related to the industry and of widespread interest (e.g: Adobe’s Firefly payouts, start of the Midjourney, proliferation of undress and deepfake apps). Provide valuable resources (e.g: list of experts on AI, book lists, and pre-made letters/comments to USCO and Congress). The goal is to spread via social media and rank in search engines while sparking group actions to ensure a narrative of ethical and safe AI is prominent in everybody’s eyes.

Desired Skills (any of the below):

 

3. Lawyers (and coders) for restricting AI data laundering

Project Lead: Remmelt Ellen

Goal: Generative AI relies on laundering large amounts of data. Legal injunctions on companies laundering copyrighted data puts their training and deployment of large models on pause. The Creative Rights Coalition is an underground coalition of artists, writers, coders, and ML researchers. We need lawyers. Lawyers who are passionate about protecting society from (current and future) harms.

Team (looking for up to 5 people):

 

4. Assessing the potential of congressional messaging campaigns for AI

Project Lead: Tristan Williams

Goal: Figure out if congressional messaging campaigns (CMCs) work, and if they do, what messages of AI concern to promote, and how to promote them in a high-quality manner. Research general CMC effectiveness and write a report. If all goes well, extend the research to develop a best strategy for deploying a CMC for AIS. Time permitting, take the findings and deploy that best strategy, attempting to help fill the void with actionable steps on AI risk for those less involved.

Desired Skills (looking for 2-5 people):

 

Mechanistic-Interpretability

5.  Modelling trajectories of language models

Research Lead: Nicky Pochinkov (me!)

Goal: Rather than asking “What next token will the Language Model Predict?” or “What next action will an RL agent take?”, I think it is important to be able to model the longer-term behaviour of models, rather than just the immediate next token or action. I think there likely exist parameter- and compute-efficient ways to summarise what kinds of longer-term trajectories/outputs a model might output given an input and its activations.

Team (looking for 2-4 people):

 

6. Towards ambitious mechanistic interpretability

Project Lead: Alice Rigg

Goal: Transformers are capable of a huge variety of tasks, and for the most part we know very little about how. Mechanistic interpretability has been posed as an AI safety agenda addressing this, through a bottom-up approach. We start with low-level components and build up to an understanding of how the most capable systems are functioning internally. But for mechanistic interpretability to be plausible as an AI safety agenda, it needs to succeed ambitiously. This project aims to: 1) Push the Pareto frontier on quality vs realism of explanations. 2) Better automated interpretability and scale feature explanations. 3) Improve the metrics for measuring the quality of explanations

Desired Skills (looking for up to 4 people):

 

7. Exploring toy models of agents

Project Leads: Paul Colognese, Arun Jose

Goal: To help develop a theory of objectives that may lead to objective detection methods in the future that can help solve the inner alignment problem. This will involve: 1) Constructing a collection of toy models of agents. 2) developing probing-based infrastructure to explore objectives/target information in these models. 3) Using this infrastructure to perform empirical analysis. 4) Summarising and writing up any interesting findings. 

This project will probably look like extending this work: Understanding and controlling a maze-solving policy network [LW · GW] to new models and environments.

Desired Skills (looking for up to 3 people):

 

8. High-level mechanistic interpretability and activation engineering library

Project Lead: Jamie Coombes

Goal: A lack of unified software tooling and standardised interfaces results in duplicated effort as researchers build one-off implementations of various mech-interp methods. Existing libraries cover a range of explainable AI methods for shallow learning models. But contemporary research on large neural networks calls for new tooling. This project seeks to build a well-architected library specifically for current techniques in mechanistic interpretability and activation engineering.

Desired Skills (looking for up to 5 people):

 

9. Out-of-context learning interpretability

Project Lead: Víctor Levoso Fernández

Goal:A few months ago a paper titled Out-of-context Meta-learning in Large Language Models was published, talking about a phenomenon called out-of-context meta-learning. More recently, there have been other papers on related topics like Taken out of context: On measuring situational awareness in LLMs or about failures of models to generalise this way like the reversal curse paper. All of these papers have in common that the models learn to apply facts it learned during training in another context. The aim of this project is to use mechanistic interpretability research on toy tasks to understand in terms of circuits and training dynamics how this kind of learning and generalisation happens in models.

Desired Skills (looking for 3-5 people):

 

10. Understanding search and goal representations in transformers

Project Lead: Michael Ivanitskiy (+ Tilman Räuker, Alex Spies. See website)

Goal: To better understand on how internal search [AF · GW] and goal representations are processed within transformer models (and whether they exist at all!). In particular, we take inspiration from existing mechanistic [LW · GW] interpretability [? · GW] agendas [LW · GW] and work with toy transformer models trained to solve mazes. Robustly solving mazes is a task may require some kind of internal search process, and gives a lot of flexibility when it comes to exploring how distributional shifts affect performance — both understanding search [AF · GW] and learning to control mesa-optimizers are important for the safety of AI systems.

Desired Skills (looking for at least 1-2 people):

 

Evaluating and Steering Models

11. Benchmarks for stable reflectivity

Project Lead: Jacques Thibodeau

Goal: Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane. There are two main ways AIs can get better: by improving their training algorithms or by improving their training data. We consider both scenarios and tentatively believe data-based improvement is riskier than architecture-based improvement. For the Supervising AIs Improving AIs agenda, we focus on ensuring stable alignment when AIs self-train or train new AIs and study how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.

Desired Skills (looking for 2-4 people):

 

12. SADDER: situational awareness datasets for detecting extreme risks

Project Lead: Rudolf Laine

Goal: One worrying capability AIs could develop is situational awareness. In particular, threat models like successfully deceptive AIs and autonomous replication and adaptation seem to depend on high situational awareness. The goal of SADDER is to better understand situational awareness in current LLMs by running experiments and constructing evals. It will be building on the Situational Awareness Dataset (SAD), which benchmarked LLMs’ understanding of how they can influence the world, and ability to guess which lifecycle stage a given text excerpt is likely to have come from, by running more in-depth experiments and adding more categories.

Desired Skills (looking for up to 2 people):

 

13. TinyEvals: how language models speak coherent English?

Project Lead: Jett Janiak [LW · GW]

Goal: TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children's stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size. I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp). The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models.

Desired Skills (looking for 2-4 people):

 

14. Evaluating alignment evaluations

Project Lead: Maxime Riche

Goal: Alignment evaluations are used to evaluate LLM behavior on a wide range of situations. They are especially used to evaluate if LLMs write harmful content, have dangerous preferences, or obey to malevolent requests. Several alignment/behavioural evaluation techniques have been published or suggested (e.g: Self-reported preferences Inference from question answering, playing games, or looking at internal states. Behaviour evaluation under steering pressure.) This project aims to review and compare existing alignment evaluations to assess their usefulness. Optionally, we want to discover better alignment evaluations or improve the existing ones.

 Desired Skills (looking for 2-4 people):

 

15. Pipelines for evaluating and steering LLMs towards faithful reasoning

Project Lead: Henning Bartsch

Goal: The research project focuses on language model alignment by developing and testing techniques for (1) evaluating model-generated reasoning and (2) steering them towards more faithful behaviour. It builds on findings and future directions from scalable oversight, model evaluations and steering techniques.

The core parts are to: 1) Benchmark closed- and open-source LLMs on faithful reasoning. 2) Build ONE pipeline to generate a dataset for fine-tuning a LLaMA model. 3) Compare the effects of fine-tuning and test-time steering on faithfulness. 4) Analyse the model behaviour and results.

Desired Skills (looking for 3-5 people with diverse skillset):

 

16. Steering of LLMs through addition of activation vectors with latent ethical valence

Project Lead: Rasmus Herlo

Goal: The idea is to identify crucial modules and activations points in LLM-architectures that are associated with positive or negative ethical valence by caching the activations during forward passes induced by specifically developed binary ethical prompts. The identified linear subspaces following serve as intervention points for direct steering through activation addition. The ultimate hope is that these adjustments immediately generate a modified LLM architecture that complies better with ethical guidelines by default without the need of adjustment modules, as used in methods like RLHF. 

Team (looking for 3-4 people):

 

Agent Foundations

17. High actuation spaces

Project Lead: Sahil

Goal: This project is an investigation into building a science of almost-but-not-actually magical regimes. Spaces where actuation is extremely cheap and fast, but not free and instantaneous. Some examples: biochemical signalling, the formation of social structures, decision theory. The hope is to be able to articulate many general and often counterintuitive facts and confusions about the insides of mind-like entities in general, including ones that exist already and apply it to fundamental problems in the caringness of an AI, like value-loading/ontological identification/corrigibility. You might call this a “deconfusion” project along the above lines.

Desired skills (looking for 2-4 people):

 People at the intersection of: 

 

18. Does sufficient optimization imply agent structure?

Project Lead: Alex Altair

Goal: There is an intuition that if a system is capable of reliably achieving a goal in a wide range of environments, then it probably has certain kinds of internal processes, like building a model of the environment from input data, generating plans, and predicting the effects of its actions on the future states of the environment. That is, it probably has some modular internal structure. To what degree can these intuitions be formally justified? Can we prove that reliable optimization implies some kind of agent-like structure [LW · GW]?  I think one could make significant progress toward clarifying the parts, or showing weaker results for some of the parts. 

Desired Skills (looking for 1-3 people):


 

19. Discovering agents in raw bytestreams

Project Lead: Paul Bricman

Goal: Being able to identify and study agents is a recurring theme in many alignment proposals, ranging from eminently theoretical [LW · GW] to directly applicable ones. Previous work paved the way for agent discovery from observations, but required an explicit decomposition of the world into variables, as well as additional scaffolding. This project consists of working towards a pipeline for detecting agency in raw byte-streams with no hints as to the nature of the agents to be detected. This could eventually enable the quantification of gradient hacking and mesa-optimization.

Team (looking for 2 people)

 

20. The science algorithm

Project Lead: Johannes C. Mayer

Goal: Modern deep learning is about having a simple program (SGD) search over a space of possible programs (the weights of a neural network) and select one that performs well according to a loss function. Even though the search program is simple, the programs it finds are neither simple nor understandable. 

My goal is to build an AI system that enables a pivotal act by figuring out the algorithms of intelligence directly. The ideal outcome is to be able to write down the entire pivotal system as a non-self-modifying program explicitly, similar to how I can write down the algorithm for quicksort.

Desired Skills (Looking for 2-3 people):

 

Miscellaneous Alignment Methods

21. SatisfIA – AI that satisfies without overdoing it

Project Lead: Jobst Heitzig

Goal: Explore novel designs for generic AI agents – AI systems that can be trained to act autonomously in a variety of environments – and their implementation in software. We will study several versions of such “non-maximizing” agent designs and corresponding learning algorithms. Rather than aiming to maximize some objective function, our agents will aim to fulfill goals that are specified via constraints called “aspirations”. For example, I might want my AI butler to prepare 100–150 ml of tea, having a temperature of 70–80°C, taking for this at most 10 minutes, spending at most $1 worth of resources, and succeeding in this with at least 95% probability.

Desired Skills (looking for 3 people):

 

22. How promising is automating alignment research? (literature review)

Project Lead: Bogdan-Ionut Cirstea

Goal: This project aims to get more grounding into how promising automating alignment research is as a strategy, with respect to both advantages and potential pitfalls, with the OpenAI superalignment plan as a potential blueprint/example. This will be achieved by reviewing, distilling and integrating relevant research from multiple areas/domains, with a particular focus on the science of deep learning and on empirical findings in deep learning and language modelling. This could expand more broadly, such as reviewing and distilling relevant literature from AI governance, multidisciplinary intersections (e.g. neuroscience), relevant prediction markets, and the automation of larger parts of AI risk mitigation research (e.g. AI governance). This could also inform how promising it might be to start more automated alignment/AI risk mitigation projects or to dedicate more resources to existing ones. 

Desired Skills (looking for 4 people):

 

23. Personalized fine-tuning token for AI value alignment

Project Lead: Eleanor ‘Nell’ Watson

Goal: We're working on a new system that makes it easier for artificial intelligence to understand what's important to you personally, while also reducing unfair or biased decisions. Our system includes easy-to-use tools that help you identify and mark different situations where the AI might be used. These tools use special techniques, like breaking down text into meaningful parts and automatically labelling them, to make it simpler to create settings that are tailored to you. By doing this, we aim to address the problem of AI not fully grasping people's unique backgrounds, preferences, and cultural differences, which can sometimes lead to biased or unsafe outcomes. 

Team (looking for 2-3 people)

 

24. Self-other overlap @AE Studio

Project Lead: Marc Carauleanu

Goal: To investigate increasing self-other overlap while not significantly altering model performance. This is because an AI has to model others as different from oneself in order to deceive or be dangerously misaligned. Thus, if the model is deceptive and outputs statements/actions that just seem correct to an outer-aligned performance metric during training, we can favour honest solution by just increasing self-other overlap without altering performance. The goal of this research project is three-fold: 1) Better define and operationalise self-other overlap in LLMs. 2) Investigate the effect of self-other overlap on adversarial and cooperative behaviour in Multi-Agent Reinforcement Learning. 3) Investigate the effect of self-other overlap on adversarial and deceptive/sycophantic behaviour in Language Modelling.

Desired Skills (see this page):

 

25. Asymmetric control in LLMs: model editing and steering that resists control for unalignment

Project Lead: Domenic Rosati

Goal: Recent efforts in concept level model steering such as Activation Addition [LW · GW] or Representation EngineeringROME and LEACE are promising approaches towards natural language generation control that is aligned with human values. However these approaches could be equally used by bad actors to unalign models and inject misinformation. This project involves developing a research direction where control interventions would be ineffective for counterfactual editing or unaligned control but remain effective for factual editing and aligned control. We call this "asymmetric control" since control can only happen in a direction towards alignment with human values not away from it.

Team (looking for 2-4 people)

 

26. Tackling key challenges in Debate

Project Lead: Paul Bricman

Goal: Debate remains [LW(p) · GW(p)] a central approach to alignment at frontier labs. In brief, it consists in having LLMs adversarially debate each other before a judge, the aggregate of which forms a deliberative system that can be used to automatically reflect on appropriate courses of action. However, the debate agenda faces a number of key challenges, mostly having to do with designing reliable means of evaluating competing parties, so as to identify the party that is closer to the truth. 

Team (looking for 3 people):

 

Other

27. AI-driven economic safety nets: restricting the macroeconomic disruptions of AGI deployment

Project Lead: Joel Naoki Ernesto 

Goal: In the face of rapid AI and AGI advancements, this project aims to investigate potential socio-economic disruptions, especially within labor markets and income distribution. The focus will be on conceptualizing economic safety mechanisms to counteract the adverse effects of AGI deployment, ensuring a smoother societal transition.

Team (looking for 3-6 people)

  1. Team Coordinator: Organise meetings, ensure timelines are met, facilitate communication within the team, manage documentation.
  2. AI Lead(s) (1-2x): Provide insights into AGI's capabilities, future trajectories, and potential economic impacts. Bridging the gap between AI advancements and economic analyses. Should have understanding of AGI development, experience AI modelling, and familiarity with global economic structures.
  3. Economist(s) (1-4x): Lead the economic analysis, model potential scenarios of AGI deployment, and contribute to the policy framework design. Should have understanding of macroeconomics, experience in policy formulation, and an understanding of AGI's potential economic ramifications.

 

28. Policy-based access to powerful models

Project Lead: Pratyush Ranjan Tiwari

Goal: As machine learning models get more powerful, restricting query access based on a safety policy becomes more important. Given a setting where a model is stored securely in a hardware-isolated environment, access to the model can be restricted based on cryptographic signatures. Policy-based signatures allow signing messages that satisfy a pre-decided policy. There are many reasons why policy enforcement should be done cryptographically, including insider threats, tamper resistance and auditability. This project leverages existing cryptographic techniques and existing discourse on AI/ML safety to come up with reasonable policies and a consequent policy-based access model to powerful models. 

Team (looking for 3 people):

 

29. Organise the next Virtual AI Safety Unconference

Project Lead: Linda Linsefors

Goal: I have a design for an online unconference, that I have run a few times. I would like to find two people to take on the task of running the next Virtual AI Safety Unconference (VAISU). Even though I have a ready format, there is room for you to improve the event design too. The goal of this project is both to produce the event, and also to pass on my organising skills to people who will hopefully use them in the future. I’m therefore looking for team members who are interested in continuing on the path of being organisers, even after this project. I’ll teach you as much as I can, but you will do all the work. The reason I’m proposing this project is because I don’t want to organise the next VAISU, I want you to do it. 

Desired Skills (looking for 2 people):

 

Apply Now

Note again that these are summaries, and the descriptions or desired may not fully reflect the author's projects or views.

If you find any of the above AI Safety Camp projects interesting, and you have some of the skills listed, then make sure to apply before 1st December 2023.

3 comments

Comments sorted by top scores.

comment by Jonathan Claybrough (lelapin) · 2023-11-28T15:18:05.844Z · LW(p) · GW(p)

Jonathan Claybrough

Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^

Replies from: Nicky
comment by NickyP (Nicky) · 2023-11-29T17:52:20.705Z · LW(p) · GW(p)

Sorry! I have fixed this now

Replies from: lelapin
comment by Jonathan Claybrough (lelapin) · 2023-11-30T14:12:16.523Z · LW(p) · GW(p)

Thanks, and thank you for this post in the first place!