Evaluations project @ ARC is hiring a researcher and a webdev/engineer

post by Beth Barnes (beth-barnes) · 2022-09-09T22:46:47.569Z · LW · GW · 7 comments

Contents

  About the project
    Big picture
    Current goals
    What we’ve done so far
    People + Logistics
  Job Descriptions
    Generalist Software Engineer, Web Dev and ML focus
  DEC 2022 EDIT: SEE NEW POSTING HERE
      Background
      Job Mission:
      Key Outcomes:
      Someday/maybe outcomes:
      Key Competencies: 
      Essential
      Nice-to-have
      Fit
      Reasons this might be a bad fit for you
      Reasons this might be an especially good fit
      Other info
      Location
      Pay
      Application process
    ===================================================
    Researcher
  DEC 2022 EDIT: SEE NEW POSTING HERE
      Job Mission:
      Key Competencies:  
      Essential
      Possible bonus skills
      Fit
      Reasons this might be a bad fit
      Reasons this might be an especially good fit
      Other info
      Location
      Pay
      Application process
None
7 comments

​The evaluations project at the Alignment Research Center is looking to hire a generalist technical researcher and a webdev-focused engineer. We're a new team at ARC building capability evaluations (and in the future, alignment evaluations) for advanced ML models. The goals of the project are to improve our understanding of what alignment danger is going to look like, understand how far away we are from dangerous AI, and create metrics that labs can make commitments around (e.g. 'If you hit capability threshold X, don't train a larger model until you've hit alignment threshold Y').  We're also taking SERI MATS fellows.
 

About the project

Big picture

Once AI systems are close to being capable enough to be an x-risk, there should be procedures that labs follow to ensure they don’t build or deploy existentially dangerous models. We should have thought through the plausible paths to models gaining power, and ensured that models are not capable of these. There should be commitments that if model capabilities reach a certain level, labs will not scale up or deploy models until they have reached a certain level of alignment. In order to make this happen, we need good ways to measure both dangerous capabilities and alignment.

We can start preparing for this now by:

Doing this now has some additional benefits:

Current goals

Our current goals are to work with cutting-edge models, produce evaluations for how far they are from x-risk capabilities, and share these evaluations with labs.

More specifically, we’re doing things like:

At the end of this, we hope to either have found clear barriers to successful powerseeking for current models, or to have produced scary demos of the model succeeding at a sequence of actions that allow it to gain power (e.g . successfully phishing someone)

We are mindful of the risks of “helping” models to engage in planning around unintended behaviors. Our high-level take is that it’s best to get a look at the problem before it’s insurmountable - “we want to see AI systems trying to take over the world while they still can’t succeed.” However, there are times when we won’t explore or publish a line of work in order to avoid potential harm.

In the future we'll probably focus more on designing evaluations for alignment.

What we’ve done so far

 

People + Logistics

The project lives at ARC, and is mostly run by Elizabeth (Beth) Barnes, who reports to Paul Christiano and is also advised by Holden Karnofsky. It is mostly siloed from ARC's other projects. We have several long-term model interaction contractors who are remote: Mihnea Maftei and Brian Goodrich full time, and a few others part-time.  Our summer SERI MATS fellows were Lora Xie, Simon Marshall, and Sarah Cogan, and Govind Pimpale was our summer intern on webdev. 

We have offices in Constellation (shared EA/longtermist offices in downtown Berkeley)
 

We should be able to sponsor visas for those who need them (and as ARC is a non-profit, we don't have to go through the visa lottery for H1-B).

 

Job Descriptions

=====================================================================

Generalist Software Engineer, Web Dev and ML focus

DEC 2022 EDIT: SEE NEW POSTING HERE

Background

We've built an interface for interacting with models and assessing their abilities at various tasks, having contractors write demo actions and rate model-generated actions. 

 

The basic idea:

- do tasks as a sequence of steps, where at each step we have the model generate a number of options for the next action, then have either the model or the humans choose which action to go with

- let models do tasks in a factored-cognition style – delegating sub-tasks to other copies of the model – then be able to view the tree of tasks and see where models struggle and where they do well

- have contractors add their own options, and rate the quality of generated options

- tag steps with what abilities or difficulties they highlight
 

 

Screenshots:

Job Mission:

Improve the interface for interacting with models so we can generate data, evaluate model properties and understand our existing data better + faster.

 

Key Outcomes:

Someday/maybe outcomes:

Key Competencies: 

Essential

Nice-to-have

Fit

Reasons this might be a bad fit for you

Reasons this might be an especially good fit

Other info

Location

We're based out of Constellation offices in Downtown Berkeley. We'd fairly strongly prefer someone who can be in-person, but WFH some days of the week is fine, and remote is not a complete dealbreaker.

Pay

TBC, but won’t be a significant downgrade from Bay Area tech roles (excluding equity). 

Application process

Send your CV and short cover letter explaining why you're interested and why you think you'd be a good fit to beth [at] alignment.org. Then we'll have you do some fairly straightforward trial tasks, working with our codebase, which will take around 2-3 hours. 

If this goes well, we'll have you do a somewhat more involved trial task which will take more like 5-6 hours, for which you'll be reimbursed $600. We might also get you to do a more design/product/user focused task, which would probably be something pretty open-ended like ‘use the interface yourself for a bit, interview a user or two, and describe what you think some of the most important changes/additions are’. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have. 
 

After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.

===================================================


 

Researcher

 

DEC 2022 EDIT: SEE NEW POSTING HERE

Job Mission:

Identifying key capabilities and behaviors of language models, identifying and evaluating scenarios where these models could pose risks, designing evaluation procedures that help assess those risks, implementing and project managing the execution of those evaluations.

 

Key Outcomes:

Key Competencies:  

Essential

Possible bonus skills

Fit

Reasons this might be a bad fit

Reasons this might be an especially good fit

 

Other info

Location

We're based out of Constellation offices in Downtown Berkeley. WFH some days of the week is fine, but being fully remote is probably a dealbreaker.

Pay

TBC, but won’t be a significant downgrade from bay area tech roles (excluding equity). 

Application process

You can do this initial application that's similar to the application for model interaction contractor role, which should be pretty straightforward but should also give you a sense of some of the types of work involved.

If this goes well you'll do a more involved trial task, which will take around 5-6 hours, for which you'll be reimbursed $600. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have. 

 

After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans. 


 

7 comments

Comments sorted by top scores.

comment by PabloAMC · 2022-11-30T14:55:46.302Z · LW(p) · GW(p)

For the record, I think Jose Orallo (and his lab), in Valencia, Spain and CSER, Cambridge, is quite interested in this same exact topics (evaluation of AI models, specifically towards safety). Jose is a really good researcher, part of the FLI existential risk faculty community, and has previously organised AI Safety conferences. Perhaps it would be interesting for you to get to know each other.

comment by P. · 2022-09-10T14:51:14.856Z · LW(p) · GW(p)

Do you have plans to measure the alignment of pure RL agents, as opposed to repurposed language models? It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one. An agent would be given first or third-person demonstrations of people trying to maximize their score in a game, and then it would try to do the same, without ever getting to see what the true reward function is. Having something like this would probably be very useful; it would allow us to directly measure goodharting, and being quantitative it might help incentivize regular ML researchers to work on alignment. Will you create something like this?

Replies from: beth-barnes, LawChan
comment by Beth Barnes (beth-barnes) · 2022-09-13T01:02:22.786Z · LW(p) · GW(p)

What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don't really think much about agents that are only trained with RL (if that's what you were referring to).

How were you thinking this would measure Goodharting in particular? 

I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don't think I'm likely to prioritize it - I don't think 'inability to learn human values' is going to be a problem for advanced AI, so I'm less excited about value learning as a key thing to work on.
 

Replies from: P.
comment by P. · 2022-09-13T19:17:45.838Z · LW(p) · GW(p)

By pure RL, I mean systems whose output channel is only directly optimized to maximize some value function, even if it might be possible to create other kinds of algorithms capable of getting good scores on the benchmark.

I don’t think that the lack of pretraining is a good thing in itself, but that you are losing a lot when you move from playing video games to completing textual tasks.

If someone is told to get a high score in a video game, we have access to the exact value function they are trying to maximize. So when the AI is either trying to play the game in the human’s place or trying to help them, we can directly evaluate their performance without having to worry about deception. If it learns some proxy values and starts optimizing them to the point of goodharting, it will get a lower score. On most textual tasks that aren’t purely about information manipulation, on the other hand, the AI could be making up plausible-sounding nonsense about the consequences of its actions, and we wouldn't have any way of knowing.

From the AI’s point of view being able to see the state of the thing we care about also seems very useful, preferences are about reality after all. It’s not obvious at all that internet text contains enough information to even learn a model of human values useful in the real world. Training it with other sources of information that more closely represent reality, like online videos, might, but that seems closer to my idea than to yours since it can’t be used to perform language-model-like imitation learning.

Additionally, if by “inability to learn human values” you mean isolating them enough so that they can in principle be optimized to get superhuman performance, as opposed to being buried in its world model, I don’t agree that that will happen by default. Right now we don’t have any implementations of proper value learning algorithms, nor do I think that any known theoretical algorithm (like PreDCA) would work even with limitless computing power. If you can show that I’m wrong, that would surprise me a lot, and I think it could change many people’s research directions and the chances they give to alignment being solvable.

comment by LawrenceC (LawChan) · 2022-09-14T01:54:49.820Z · LW(p) · GW(p)

It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one.

My guess is the issue here is a lack of a single standard, as opposed to there not being any? The closest thing there is to a standard in IRL/RLHF work are the Mujoco Gym and Atari environments. People also often make variants of Mujoco environments like Assistive Gym when they have a specific task in mind as well. Or they just use a real robot, or maybe a VR one.

If your concern is that researchers are using other policies as experts and not humans, well, there's always Atari-HEAD or the CrowdPlay Atari dataset (there's not an equivalent for Mujoco envs because humans can't really do well on those environments involved without assistance or a lot of practice). If you want something else, there's always D4RL

Replies from: P.
comment by P. · 2022-09-14T13:25:31.204Z · LW(p) · GW(p)

The simplest possible acceptable value learning benchmark would look something like this:

  • Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
  • The bot is first given all their inputs and outputs, but not their rewards.
  • Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
  • The goal is to maximize the true reward function.
  • These rules are precisely described and are known by anyone who wants to test their algorithms.

None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.

A harder and more realistic version that I have yet to see in any paper would look something like this:

  • Data is recorded of people playing a game with a second player. The second player can be a human or a bot, and friendly, neutral or adversarial.
  • The IO of both players is different, just like different people have different perspectives in real life.
  • A very good imitation learner is trained to predict the first player's output given their input. It comes with the benchmark.
  • The bot to be tested (which is different from the previous ones) has the same IO channels as the second player, but doesn't see the rewards. It also isn't given any of the recordings.
  • Optionally, it also receives the output of a bad visual object detector meant to detect the part of the environment directly controlled by the human/imitator.
  • It plays the game with the human imitator.
  • The goal is to maximize the human’s reward function.

It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.

Replies from: LawChan
comment by LawrenceC (LawChan) · 2022-09-14T21:13:18.533Z · LW(p) · GW(p)

None of the environments and datasets you mention are actually like this.

Every single algorithmic IRL paper on video games does this, at least with Deep RL demonstrators. (Here's a list of 4 examples:  https://arxiv.org/abs/1810.10593, https://proceedings.mlr.press/v97/brown19a.html, https://arxiv.org/abs/1902.07742, https://arxiv.org/abs/2002.09089,

If you care about human demonstrations, it seems like Atari-HEAD and the CrowdPlay Atari dataset both do exactly this? And while there haven't been too much work in this area, a quick Google search let me find two papers that do analyze IRL variants on Atari-HEAD: https://arxiv.org/abs/1908.02511v2 and https://arxiv.org/abs/2004.00981v2 .  

My guess is the reason there hasn't been much recent work in this area is because there just aren't many people who think that value learning from demonstrations is interesting (instead, people have moved to pairwise comparisons of trajectories or language feedback). In addition, as LMs have become more capable, most of the existing value learning researchers have also moved on from working with video games to moving on LMs.