AISC 2024 - Project Summaries

nicky

AISC 2024 - Project Summaries

post by NickyP (Nicky) · 2023-11-27T22:32:23.555Z · LW · GW · 3 comments

List of AISC Projects
To not build uncontrollable AI
1. Towards realistic ODDs for foundation model based AI offerings
2. Luddite Pro: information for the refined luddite
3. Lawyers (and coders) for restricting AI data laundering
4. Assessing the potential of congressional messaging campaigns for AI
Mechanistic-Interpretability
5. Modelling trajectories of language models
6. Towards ambitious mechanistic interpretability
7. Exploring toy models of agents
8. High-level mechanistic interpretability and activation engineering library
9. Out-of-context learning interpretability
10. Understanding search and goal representations in transformers
Evaluating and Steering Models
11. Benchmarks for stable reflectivity
12. SADDER: situational awareness datasets for detecting extreme risks
13. TinyEvals: how language models speak coherent English?
14. Evaluating alignment evaluations
15. Pipelines for evaluating and steering LLMs towards faithful reasoning
16. Steering of LLMs through addition of activation vectors with latent ethical valence
Agent Foundations
17. High actuation spaces
18. Does sufficient optimization imply agent structure?
19. Discovering agents in raw bytestreams
20. The science algorithm
Miscellaneous Alignment Methods
21. SatisfIA – AI that satisfies without overdoing it
22. How promising is automating alignment research? (literature review)
23. Personalized fine-tuning token for AI value alignment
24. Self-other overlap @AE Studio
25. Asymmetric control in LLMs: model editing and steering that resists control for unalignment
26. Tackling key challenges in Debate
Other
27. AI-driven economic safety nets: restricting the macroeconomic disruptions of AGI deployment
28. Policy-based access to powerful models
29. Organise the next Virtual AI Safety Unconference
Apply Now
None
3 comments

Apply to AI Safety Camp 2024 by 1st December 2023. All mistakes here are my own.

Below are some summaries for each project proposal, listed in order of how they appear on the website. These are edited by me, and most have not yet been reviewed by the project leads. I think having a list like this makes it easier for people to navigate all the different projects, and the original post [LW · GW]/website did not have one, so I made this.

If a project catches your interest, click on the title to read more about it.

Note that the summarisation here is lossy. The desired skills as here may be misrepresented, and if you are interested, you should check the original project for more details. In particular, many of the "desired skills" are often listed such that having only a few would be helpful, but this isn't consistent.

List of AISC Projects

To not build uncontrollable AI

1. Towards realistic ODDs for foundation model based AI offerings

Project Lead: Igor Krawczuk

Goal: Current methods for alignment applied to language models is akin to "blacklisting" behaviours that are bad. Operational Design Domain (OOD) is instead, akin to more exact "whitelisting" design principles, and now allowing deviations from this. The project wants to build a proof of concept, and show that this is hopefully feasible, economical and effective.

Team (Looking for 4-6 people):

"Spec Researcher": Draft the spec for guidelines, and publish a request for comments. Should have experience in safety settings
"Mining Researcher": Look for use cases, and draft the "slicing" of OOD.
"User Access Researcher": Write drafts on feasibility of KYC and user access levels.
"Lit Review Researcher(s)": Reading recent relevant literature on high-assurance methods for ML.
"Proof of Concept Researcher": build a proof of concept. Should have knowledge of OpenAI and interfacing with/architecting APIs.

2. Luddite Pro: information for the refined luddite

Project Lead: Brian Penny

Goal: Develop a news website filled with stories, information, and resources related to the development of artificial intelligence in society. Cover specific stories related to the industry and of widespread interest (e.g: Adobe’s Firefly payouts, start of the Midjourney, proliferation of undress and deepfake apps). Provide valuable resources (e.g: list of experts on AI, book lists, and pre-made letters/comments to USCO and Congress). The goal is to spread via social media and rank in search engines while sparking group actions to ensure a narrative of ethical and safe AI is prominent in everybody’s eyes.

Desired Skills (any of the below):

Art, design, and photography - Develop visual content to use as header images for every story. If you have any visual design skills, these are very necessary.
Journalism - journalistic and research backgrounds capable of interviewing subject-matter experts & writing long-form stories related to AI companies.
Technical Writing - Tutorials of technical tools like Glaze and Nightshade. Experience in technical writing & being familiar with these applications.
Wordpress/Web Development - Refine pages to be more user-friendly as well as help setting up templates for people to fill out for calls to action. Currently, the site is running a default WordPress template.
Marketing/PR - The website is filled with content, but it requires a lot of marketing and PR efforts to reach the target audience. If you have any experience working in an agency or in-house marketing/comms, we would love to hear from you.

3. Lawyers (and coders) for restricting AI data laundering

Project Lead: Remmelt Ellen

Goal: Generative AI relies on laundering large amounts of data. Legal injunctions on companies laundering copyrighted data puts their training and deployment of large models on pause. The Creative Rights Coalition is an underground coalition of artists, writers, coders, and ML researchers. We need lawyers. Lawyers who are passionate about protecting society from (current and future) harms.

Team (looking for up to 5 people):

Lawyers: File DMCA takedown requests, Pre-research for an EU Lawsuit. Should have law education (Master of Law) and basic knowledge of international copyright and/or data protection practices.
Coders: Create a tool to check if a creator's work is in an AI dataset. Should be able to improve code in these GitHub repos.

4. Assessing the potential of congressional messaging campaigns for AI

Project Lead: Tristan Williams

Goal: Figure out if congressional messaging campaigns (CMCs) work, and if they do, what messages of AI concern to promote, and how to promote them in a high-quality manner. Research general CMC effectiveness and write a report. If all goes well, extend the research to develop a best strategy for deploying a CMC for AIS. Time permitting, take the findings and deploy that best strategy, attempting to help fill the void with actionable steps on AI risk for those less involved.

Desired Skills (looking for 2-5 people):

Generalist Research Skills
Communication skills (outreach to various organisations)
Writing skills (making the report accessible)
Web design (To possibly test a messaging approach)
Policy Making (Having insight into how congressional offices work would be ideal)
CMC experience.

Mechanistic-Interpretability

5. Modelling trajectories of language models

Research Lead: Nicky Pochinkov (me!)

Goal: Rather than asking “What next token will the Language Model Predict?” or “What next action will an RL agent take?”, I think it is important to be able to model the longer-term behaviour of models, rather than just the immediate next token or action. I think there likely exist parameter- and compute-efficient ways to summarise what kinds of longer-term trajectories/outputs a model might output given an input and its activations.

Team (looking for 2-4 people):

Theorist: Conceptualising the best ways to summaries long trajectories into "chains of themes", thinking about how to measure uncertainty, trying to understand "goals". Math/Physics background would be ideal.
Software Engineer(s): Writing code to convert text generations into "chains of themes". Writing models that convert model neuron activations into predictions about chains of themes.
Distiller: Read and understand materials. Converting messy language and experiments from other people into more understandable and easy to read form.

6. Towards ambitious mechanistic interpretability

Project Lead: Alice Rigg

Goal: Transformers are capable of a huge variety of tasks, and for the most part we know very little about how. Mechanistic interpretability has been posed as an AI safety agenda addressing this, through a bottom-up approach. We start with low-level components and build up to an understanding of how the most capable systems are functioning internally. But for mechanistic interpretability to be plausible as an AI safety agenda, it needs to succeed ambitiously. This project aims to: 1) Push the Pareto frontier on quality vs realism of explanations. 2) Better automated interpretability and scale feature explanations. 3) Improve the metrics for measuring the quality of explanations

Desired Skills (looking for up to 4 people):

Software Engineering: Proficiency in PyTorch. Experience with Transformers. Familiarity with existing interpretability work preferred.

7. Exploring toy models of agents

Project Leads: Paul Colognese, Arun Jose

Goal: To help develop a theory of objectives that may lead to objective detection methods in the future that can help solve the inner alignment problem. This will involve: 1) Constructing a collection of toy models of agents. 2) developing probing-based infrastructure to explore objectives/target information in these models. 3) Using this infrastructure to perform empirical analysis. 4) Summarising and writing up any interesting findings.

This project will probably look like extending this work: Understanding and controlling a maze-solving policy network [LW · GW] to new models and environments.

Desired Skills (looking for up to 3 people):

Software Engineering: Familiarity with Python and PyTorch. Experience with GitHub/Collaborating with other engineers.

8. High-level mechanistic interpretability and activation engineering library

Project Lead: Jamie Coombes

Goal: A lack of unified software tooling and standardised interfaces results in duplicated effort as researchers build one-off implementations of various mech-interp methods. Existing libraries cover a range of explainable AI methods for shallow learning models. But contemporary research on large neural networks calls for new tooling. This project seeks to build a well-architected library specifically for current techniques in mechanistic interpretability and activation engineering.

Desired Skills (looking for up to 5 people):

Software Engineering: Experience with PyTorch, Jupyter, and data science workflows. Familiarity with transformer architectures and LLMs would be ideal.
Writing Skills: Research and Academic writing skill to help document methodologies and results for publication.
Low-level systems programming expertise to quickly pick up Mojo for performance-critical operations. This is by no means mandatory but is a plus.

9. Out-of-context learning interpretability

Project Lead: Víctor Levoso Fernández

Goal:A few months ago a paper titled Out-of-context Meta-learning in Large Language Models was published, talking about a phenomenon called out-of-context meta-learning. More recently, there have been other papers on related topics like Taken out of context: On measuring situational awareness in LLMs or about failures of models to generalise this way like the reversal curse paper. All of these papers have in common that the models learn to apply facts it learned during training in another context. The aim of this project is to use mechanistic interpretability research on toy tasks to understand in terms of circuits and training dynamics how this kind of learning and generalisation happens in models.

Desired Skills (looking for 3-5 people):

Software Engineering: basic CS and coding skills. Being knowledgeable about ML and or having Pytorch skills helps.
Math Knowledge

10. Understanding search and goal representations in transformers

Project Lead: Michael Ivanitskiy (+ Tilman Räuker, Alex Spies. See website)

Goal: To better understand on how internal search [AF · GW] and goal representations are processed within transformer models (and whether they exist at all!). In particular, we take inspiration from existing mechanistic [LW · GW] interpretability [? · GW] agendas [LW · GW] and work with toy transformer models trained to solve mazes. Robustly solving mazes is a task may require some kind of internal search process, and gives a lot of flexibility when it comes to exploring how distributional shifts affect performance — both understanding search [AF · GW] and learning to control mesa-optimizers are important for the safety of AI systems.

Desired Skills (looking for at least 1-2 people):

Software Engineering: Proficiency with Git, one of Pytorch/Tensorflow/Jax. Understanding of Transformers. Basic Familiarity with Alignment.

Evaluating and Steering Models

11. Benchmarks for stable reflectivity

Project Lead: Jacques Thibodeau

Goal: Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane. There are two main ways AIs can get better: by improving their training algorithms or by improving their training data. We consider both scenarios and tentatively believe data-based improvement is riskier than architecture-based improvement. For the Supervising AIs Improving AIs agenda, we focus on ensuring stable alignment when AIs self-train or train new AIs and study how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.

Desired Skills (looking for 2-4 people):

Software Engineering: Experience with Python. Either a good software engineer or a decent understanding of the basics of AI alignment and language models.
Ideal: Understanding of online/active learning of ML systems. Creating datasets with language models. Unsupervised learning techniques. Code for data pipelines (for the benchmarks) that could be easily integrated into AI training.
Understanding of how self-improving AI systems can evolve and understands all the capabilities we are trying to keep track of to prevent dangerous systems.

12. SADDER: situational awareness datasets for detecting extreme risks

Project Lead: Rudolf Laine

Goal: One worrying capability AIs could develop is situational awareness. In particular, threat models like successfully deceptive AIs and autonomous replication and adaptation seem to depend on high situational awareness. The goal of SADDER is to better understand situational awareness in current LLMs by running experiments and constructing evals. It will be building on the Situational Awareness Dataset (SAD), which benchmarked LLMs’ understanding of how they can influence the world, and ability to guess which lifecycle stage a given text excerpt is likely to have come from, by running more in-depth experiments and adding more categories.

Desired Skills (looking for up to 2 people):

AI Safety Knowledge: General awareness and some judgement about AI safety.
Software Engineering: More conceptual people should be able to do simple data science workflows (e.g. writing Python to graph results). More engineering people should show a well-structured clear code project you have written, or a track record of strong software engineering (e.g. internships/jobs).
Experimental reasoning skills: can you think of ways in which results could be wrong or misleading, and invent experiments that disambiguate between hypotheses?
(Ideally): Existing research experience.

13. TinyEvals: how language models speak coherent English?

Project Lead: Jett Janiak [LW · GW]

Goal: TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children's stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size. I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp). The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models.

Desired Skills (looking for 2-4 people):

Software Engineering: Knowledge of python, jupyter notebooks, git.
- Nice to have: PyTorch, HuggingFace, TransformerLens, Plotly, Mech interp experience, Research experience
Technical writing ability
Web development: HTML, CSS, JavaScript

14. Evaluating alignment evaluations

Project Lead: Maxime Riche

Goal: Alignment evaluations are used to evaluate LLM behavior on a wide range of situations. They are especially used to evaluate if LLMs write harmful content, have dangerous preferences, or obey to malevolent requests. Several alignment/behavioural evaluation techniques have been published or suggested (e.g: Self-reported preferences Inference from question answering, playing games, or looking at internal states. Behaviour evaluation under steering pressure.) This project aims to review and compare existing alignment evaluations to assess their usefulness. Optionally, we want to discover better alignment evaluations or improve the existing ones.

Desired Skills (looking for 2-4 people):

Reading Python code.
Having used LLM and knowing about their completion patterns.
Data science skills.

15. Pipelines for evaluating and steering LLMs towards faithful reasoning

Project Lead: Henning Bartsch

Goal: The research project focuses on language model alignment by developing and testing techniques for (1) evaluating model-generated reasoning and (2) steering them towards more faithful behaviour. It builds on findings and future directions from scalable oversight, model evaluations and steering techniques.

The core parts are to: 1) Benchmark closed- and open-source LLMs on faithful reasoning. 2) Build ONE pipeline to generate a dataset for fine-tuning a LLaMA model. 3) Compare the effects of fine-tuning and test-time steering on faithfulness. 4) Analyse the model behaviour and results.

Desired Skills (looking for 3-5 people with diverse skillset):

Software Engineering: Expertise in Python, software development practices, and the ability to create effective abstractions for classes and pipelines.
Research and conceptual work: Skills in experimental design, generating ideas and the capacity to effectively communicate new ideas within the team. We also need to analyse and interpret results and write up a paper.
Quick Prototyping: implementation and validation of ideas are important, especially in the early stage of the project we want to test and iterate quickly.
Familiarity with Concepts: Understanding of key AI safety ideas, and other ideas in the proposed document is important.

16. Steering of LLMs through addition of activation vectors with latent ethical valence

Project Lead: Rasmus Herlo

Goal: The idea is to identify crucial modules and activations points in LLM-architectures that are associated with positive or negative ethical valence by caching the activations during forward passes induced by specifically developed binary ethical prompts. The identified linear subspaces following serve as intervention points for direct steering through activation addition. The ultimate hope is that these adjustments immediately generate a modified LLM architecture that complies better with ethical guidelines by default without the need of adjustment modules, as used in methods like RLHF.

Team (looking for 3-4 people):

Code Architects (2-3x): Modify LLM-structure according to devised intervention points, and generate/save at least three altered versions of the LLM-architecture. Should be Proficient in Python and Git/GitHub.
- Scientific communication, Data-caching, figure production skills ideal.
Ethics Consultant (1x): Will help develop and design binary ethical prompts for LLMs, and the protocols to test the LLMs before/after ethical steering in both ethical and unethical direction. Reading of relevant literature.
- Experience with scientific methodology and producing ethical literature to a high scientific standard. (e.g: philosophy, psychology or anthropology). Should understand philosophical concepts like utilitarianism and 'trolley problems'.

Agent Foundations

17. High actuation spaces

Project Lead: Sahil

Goal: This project is an investigation into building a science of almost-but-not-actually magical regimes. Spaces where actuation is extremely cheap and fast, but not free and instantaneous. Some examples: biochemical signalling, the formation of social structures, decision theory. The hope is to be able to articulate many general and often counterintuitive facts and confusions about the insides of mind-like entities in general, including ones that exist already and apply it to fundamental problems in the caringness of an AI, like value-loading/ontological identification/corrigibility. You might call this a “deconfusion” project along the above lines.

Desired skills (looking for 2-4 people):

People at the intersection of:

of math and philosophy
of prosaic alignment/modern ML and agent foundations
of computer science and biological/sociological lenses
of rigor and ritual
of material and phenomenological investigation
of systematic and postsystematic modes
of strong agreement and subtle disagreement with MIRI-esque views on alignment
of intrigue and skepticism around shard theory

18. Does sufficient optimization imply agent structure?

Project Lead: Alex Altair

Goal: There is an intuition that if a system is capable of reliably achieving a goal in a wide range of environments, then it probably has certain kinds of internal processes, like building a model of the environment from input data, generating plans, and predicting the effects of its actions on the future states of the environment. That is, it probably has some modular internal structure. To what degree can these intuitions be formally justified? Can we prove that reliable optimization implies some kind of agent-like structure [LW · GW]? I think one could make significant progress toward clarifying the parts, or showing weaker results for some of the parts.

Desired Skills (looking for 1-3 people):

Trying to locate the right definitions of things like “agent” and “plan”, and for proofs to follow more easily.
Familiarity with any topics that could be candidates for formalizing the theorem.
A solid grasp of how mathematical formalisms relate to reality, and how proofs work.

19. Discovering agents in raw bytestreams

Project Lead: Paul Bricman

Goal: Being able to identify and study agents is a recurring theme in many alignment proposals, ranging from eminently theoretical [LW · GW] to directly applicable ones. Previous work paved the way for agent discovery from observations, but required an explicit decomposition of the world into variables, as well as additional scaffolding. This project consists of working towards a pipeline for detecting agency in raw byte-streams with no hints as to the nature of the agents to be detected. This could eventually enable the quantification of gradient hacking and mesa-optimization.

Team (looking for 2 people):

[Both] Software Engineering. Python fluency, designing maintainable codebases, familiarity with JAX. Strong technical writing is a plus.
[Paper 1] Differentiable Correlates of Complexity and Sophistication.
- 1) Benchmark estimates of complexity and sophistication employed in the broader agency detection pipeline. 2) Facilitate the integration of agent foundations with contemporary differentiable architectures more broadly.
[Paper 2] Recovering Agents in Gym Environments.
- Attempt to recover RL agents that were previously introduced in certain gym environments. Test if the method can correctly identify agents. Investigate how agent hyperparameters influence the detection of agents.

20. The science algorithm

Project Lead: Johannes C. Mayer

Goal: Modern deep learning is about having a simple program (SGD) search over a space of possible programs (the weights of a neural network) and select one that performs well according to a loss function. Even though the search program is simple, the programs it finds are neither simple nor understandable.

My goal is to build an AI system that enables a pivotal act by figuring out the algorithms of intelligence directly. The ideal outcome is to be able to write down the entire pivotal system as a non-self-modifying program explicitly, similar to how I can write down the algorithm for quicksort.

Desired Skills (Looking for 2-3 people):

Ability to think and work independently: You should be able to generate, evaluate, and execute upon ideas without the need for constant oversight. Recognise and challenge things if I say something questionable.
Software Engineering Skills: Writing programs of 100s lines of code, Thinking about how to structure a codebase. Reimplementing/coming up with algorithms. Writing parallelisable code.
Basic math skills: know what these symbols mean ∑, Π, ×, ∧, ∨. Able to proof simple statements, etc.

Miscellaneous Alignment Methods

21. SatisfIA – AI that satisfies without overdoing it

Project Lead: Jobst Heitzig

Goal: Explore novel designs for generic AI agents – AI systems that can be trained to act autonomously in a variety of environments – and their implementation in software. We will study several versions of such “non-maximizing” agent designs and corresponding learning algorithms. Rather than aiming to maximize some objective function, our agents will aim to fulfill goals that are specified via constraints called “aspirations”. For example, I might want my AI butler to prepare 100–150 ml of tea, having a temperature of 70–80°C, taking for this at most 10 minutes, spending at most $1 worth of resources, and succeeding in this with at least 95% probability.

Desired Skills (looking for 3 people):

Reinforcement Learning: Solid knowledge of tabular RL or deep RL desirable.
Probability Theory: Designing (Markov decision process) agents and algorithms
Software Engineering: implementing agents in software (Python/WebPPL), simulating their behaviour in selected test environments (AI safety grid-worlds),
Formulating hypotheses about agent behaviour, especially about its safety-relevant consequences, then trying to prove/disprove these hypotheses
Writing Skills: writing results in blog posts and (possibly) an academic paper.

22. How promising is automating alignment research? (literature review)

Project Lead: Bogdan-Ionut Cirstea

Goal: This project aims to get more grounding into how promising automating alignment research is as a strategy, with respect to both advantages and potential pitfalls, with the OpenAI superalignment plan as a potential blueprint/example. This will be achieved by reviewing, distilling and integrating relevant research from multiple areas/domains, with a particular focus on the science of deep learning and on empirical findings in deep learning and language modelling. This could expand more broadly, such as reviewing and distilling relevant literature from AI governance, multidisciplinary intersections (e.g. neuroscience), relevant prediction markets, and the automation of larger parts of AI risk mitigation research (e.g. AI governance). This could also inform how promising it might be to start more automated alignment/AI risk mitigation projects or to dedicate more resources to existing ones.

Desired Skills (looking for 4 people):

Minimum ML understanding, AI safety tech knowledge equivalent to having gone through AGISF, good communication (distillation) skills, basic research skills.
A wide variety of additional skills could be useful, especially good distillation (strong writing skills), strong generalist skills, more advanced ML/theoretical CS/math skills.

23. Personalized fine-tuning token for AI value alignment

Project Lead: Eleanor ‘Nell’ Watson

Goal: We're working on a new system that makes it easier for artificial intelligence to understand what's important to you personally, while also reducing unfair or biased decisions. Our system includes easy-to-use tools that help you identify and mark different situations where the AI might be used. These tools use special techniques, like breaking down text into meaningful parts and automatically labelling them, to make it simpler to create settings that are tailored to you. By doing this, we aim to address the problem of AI not fully grasping people's unique backgrounds, preferences, and cultural differences, which can sometimes lead to biased or unsafe outcomes.

Team (looking for 2-3 people):

Fine-Tuning/RLHF/Constitutional AI Expertise: We're interested in how our "values token" can be effectively used with techniques like RLHF and Fine Tuning. Theoretical approaches in representation engineering and contrastive preference modeling highlight potential means to accomplish this.
Cybersecurity Expertise: Help make our system (including potentially sensitive data about people's values) as secure as possible, including methods such as homomorphic encryption.
Vector Databases: We plan to turn Likert-scale responses about values into numerical vectors (think of it as 'value2vec'). We need people who are skilled in creating these kinds of vector databases.

24. Self-other overlap @AE Studio

Project Lead: Marc Carauleanu

Goal: To investigate increasing self-other overlap while not significantly altering model performance. This is because an AI has to model others as different from oneself in order to deceive or be dangerously misaligned. Thus, if the model is deceptive and outputs statements/actions that just seem correct to an outer-aligned performance metric during training, we can favour honest solution by just increasing self-other overlap without altering performance. The goal of this research project is three-fold: 1) Better define and operationalise self-other overlap in LLMs. 2) Investigate the effect of self-other overlap on adversarial and cooperative behaviour in Multi-Agent Reinforcement Learning. 3) Investigate the effect of self-other overlap on adversarial and deceptive/sycophantic behaviour in Language Modelling.

Desired Skills (see this page):

Minimum:
- Software Engineering: Extensive experience with Python software development and data engineering. Experience using AWS and libraries such as PyTorch
Desirable:
- A PhD in a relevant subject (CS, ML, Computational Neuroscience)
- Experience with AI Safety research (anything is fine)

25. Asymmetric control in LLMs: model editing and steering that resists control for unalignment

Project Lead: Domenic Rosati

Goal: Recent efforts in concept level model steering such as Activation Addition [LW · GW] or Representation Engineering, ROME and LEACE are promising approaches towards natural language generation control that is aligned with human values. However these approaches could be equally used by bad actors to unalign models and inject misinformation. This project involves developing a research direction where control interventions would be ineffective for counterfactual editing or unaligned control but remain effective for factual editing and aligned control. We call this "asymmetric control" since control can only happen in a direction towards alignment with human values not away from it.

Team (looking for 2-4 people):

Conceptual Alignment/AI Safety: Understanding major concepts in AI/Value Alignment literature. Thinking conceptually through motivation, experimental proof, theoretical proof of proposed interventions. Basic familiarity with tools such as first-order logic, upper and lower bound analysis, proofs, philosophical analysis
Risk Assessment/Evaluation: Understanding of the research landscape in AI Risk, and (technical) AI risk assessment and evaluation. Creative thinking around experimental design for evaluation
ML Engineering: Design, training, and evaluation of neural networks. Creative thinking around technical approaches to alignment
NLP: LLMs and Transformers. Interpretability and Representation Probing techniques. Modelling and evaluation of NLP techniques

26. Tackling key challenges in Debate

Project Lead: Paul Bricman

Goal: Debate remains [LW(p) · GW(p)] a central approach to alignment at frontier labs. In brief, it consists in having LLMs adversarially debate each other before a judge, the aggregate of which forms a deliberative system that can be used to automatically reflect on appropriate courses of action. However, the debate agenda faces a number of key challenges, mostly having to do with designing reliable means of evaluating competing parties, so as to identify the party that is closer to the truth.

Team (looking for 3 people):

[All] Software Engineering: Python fluency. Designing maintainable codebases. HuggingFace. Strong technical writing, and familiarity with Torch/JAX are a plus.
[1] Textual-Symbolic Interoperability by Unsupervised Machine Translation:
Self-distilling the ability to manipulate formal and natural language representations of statements in parallel. Follow the methodology of prior work on self-distillation for machine translation out of OpenAI and DeepMind.
[2] Quantifying Credence by Consistency Across Complete Logics:
Generalising Contrast Consistent Search. The previous technique implicitly relies on negation in most fuzzy logics (¬P = 1 - P), as well as probability theory, then optimises for credence probes that are consistent across any statement expressed using a functionally complete set of connectives (P ^ ¬Q = P * (1 - Q)).
[3] Granular Control Over Compute Expenditure by Tuned Lens Decoding:
Controlling the amount of computation employed in LLM inference. The key idea is to implement an efficient decoding method based on the tuned lens method so as to only work with the model’s “best guesses” at a given intermediate layer.

Other

27. AI-driven economic safety nets: restricting the macroeconomic disruptions of AGI deployment

Project Lead: Joel Naoki Ernesto

Goal: In the face of rapid AI and AGI advancements, this project aims to investigate potential socio-economic disruptions, especially within labor markets and income distribution. The focus will be on conceptualizing economic safety mechanisms to counteract the adverse effects of AGI deployment, ensuring a smoother societal transition.

Team (looking for 3-6 people):

Team Coordinator: Organise meetings, ensure timelines are met, facilitate communication within the team, manage documentation.
AI Lead(s) (1-2x): Provide insights into AGI's capabilities, future trajectories, and potential economic impacts. Bridging the gap between AI advancements and economic analyses. Should have understanding of AGI development, experience AI modelling, and familiarity with global economic structures.
Economist(s) (1-4x): Lead the economic analysis, model potential scenarios of AGI deployment, and contribute to the policy framework design. Should have understanding of macroeconomics, experience in policy formulation, and an understanding of AGI's potential economic ramifications.

28. Policy-based access to powerful models

Project Lead: Pratyush Ranjan Tiwari

Goal: As machine learning models get more powerful, restricting query access based on a safety policy becomes more important. Given a setting where a model is stored securely in a hardware-isolated environment, access to the model can be restricted based on cryptographic signatures. Policy-based signatures allow signing messages that satisfy a pre-decided policy. There are many reasons why policy enforcement should be done cryptographically, including insider threats, tamper resistance and auditability. This project leverages existing cryptographic techniques and existing discourse on AI/ML safety to come up with reasonable policies and a consequent policy-based access model to powerful models.

Team (looking for 3 people):

For either of the roles below, no experience in cryptography is required. Interest in AI safety policy and a broad math/theoretical CS background is beneficial.
Research Eng. Roles (2x): Experience prototyping ideas to code is required. Background in experimenting with powerful models/LLMs is useful. Experience reading research papers is essential
Researcher (1). Background in ML research + experience writing research papers/technical documentation. Useful to know: Similarity-driven NLP classification, Semantic Hashing, and general NLP techniques. Inclination towards learning new, cross-disciplinary techniques will go a long way.

29. Organise the next Virtual AI Safety Unconference

Project Lead: Linda Linsefors

Goal: I have a design for an online unconference, that I have run a few times. I would like to find two people to take on the task of running the next Virtual AI Safety Unconference (VAISU). Even though I have a ready format, there is room for you to improve the event design too. The goal of this project is both to produce the event, and also to pass on my organising skills to people who will hopefully use them in the future. I’m therefore looking for team members who are interested in continuing on the path of being organisers, even after this project. I’ll teach you as much as I can, but you will do all the work. The reason I’m proposing this project is because I don’t want to organise the next VAISU, I want you to do it.

Desired Skills (looking for 2 people):

Interest in event design: Willingness to do some amount of boring spreadsheet tasks. Automate as much as you want, but probably can't automate everything.
Noticing what you don’t know yet: Good at extracting information from me. I have not been very successful at delegating in the past. Key Information often gets lost in transmission.
Good written communication. When producing written information for the unconference participants, you need to look at your own text and tell if it’s good enough, and improve what needs improving.
Paying attention to everything: You don’t need to know how to solve all problems. Just noticing there is a problem and asking for advice is good enough.

Apply Now

Note again that these are summaries, and the descriptions or desired may not fully reflect the author's projects or views.

If you find any of the above AI Safety Camp projects interesting, and you have some of the skills listed, then make sure to apply before 1st December 2023.

3 comments

Comments sorted by top scores.

comment by Jonathan Claybrough (lelapin) · 2023-11-28T15:18:05.844Z · LW(p) · GW(p)

Jonathan Claybrough

Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^

Replies from: Nicky

↑ comment by NickyP (Nicky) · 2023-11-29T17:52:20.705Z · LW(p) · GW(p)

Sorry! I have fixed this now

Replies from: lelapin

↑ comment by Jonathan Claybrough (lelapin) · 2023-11-30T14:12:16.523Z · LW(p) · GW(p)

Thanks, and thank you for this post in the first place!

AISC 2024 - Project Summaries

Contents

List of AISC Projects

To not build uncontrollable AI

1. Towards realistic ODDs for foundation model based AI offerings

2. Luddite Pro: information for the refined luddite

3. Lawyers (and coders) for restricting AI data laundering

4. Assessing the potential of congressional messaging campaigns for AI

Mechanistic-Interpretability

5. Modelling trajectories of language models

6. Towards ambitious mechanistic interpretability

7. Exploring toy models of agents

8. High-level mechanistic interpretability and activation engineering library

9. Out-of-context learning interpretability

10. Understanding search and goal representations in transformers

Evaluating and Steering Models

11. Benchmarks for stable reflectivity

12. SADDER: situational awareness datasets for detecting extreme risks

13. TinyEvals: how language models speak coherent English?

14. Evaluating alignment evaluations

15. Pipelines for evaluating and steering LLMs towards faithful reasoning

16. Steering of LLMs through addition of activation vectors with latent ethical valence

Agent Foundations

17. High actuation spaces

18. Does sufficient optimization imply agent structure?

19. Discovering agents in raw bytestreams

20. The science algorithm

Miscellaneous Alignment Methods

21. SatisfIA – AI that satisfies without overdoing it

22. How promising is automating alignment research? (literature review)

23. Personalized fine-tuning token for AI value alignment

24. Self-other overlap @AE Studio

25. Asymmetric control in LLMs: model editing and steering that resists control for unalignment

26. Tackling key challenges in Debate

Other

27. AI-driven economic safety nets: restricting the macroeconomic disruptions of AGI deployment

28. Policy-based access to powerful models

29. Organise the next Virtual AI Safety Unconference

Apply Now

3 comments