Which evals resources would be good?

post by Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · LW · GW · 4 comments

Contents

  Future plans and missing resources 
None
4 comments

I want to make a serious effort to create a bigger evals field. I’m very interested in which resources you think would be most helpful. I’m also looking for others to contribute and potentially some funding. 

  1. Which resources would be most helpful? Suggestions include:
    1. Open Problems in Evals list: A long list of open relevant problems & projects in evals.
    2. More Inspect Tutorials/Examples: Specifically, examples with deep explanations, examples of complex agent setups, and more detailed examples of the components around the evals.
    3. Evals playbook: A detailed guide on how to build evals with detailed examples for agentic and non-agentic evals.
    4. Salient demos: I don’t even necessarily want to build scary demos, I just want normal people to be less hit-by-a-truck when they learn about LM agent capabilities in the near future.
    5. More “day-to-day” evals process resources: A lot of evals expertise is latent knowledge that professional evaluators have accumulated over time. 
    6. An evals slack
    7. Talk through the most important evals papers
    8. Evals workshops and conferences
  2. Are you interested in contributing to any of these efforts?
    1. Right now I think the following types of contributions would be helpful:
      1. People who want to contribute more detailed Inspect tutorials / Examples as described here. I think this is a great opportunity for people who are considering working full-time on evals and want to test fit / make a name for themselves. 
      2. Someone who wants to help me coordinate these efforts part time or full-time. This would be a mix of playing around with evals yourself, coordinating with other actors in evals and probably a decent amount of writing.
    2. I’m very interested in other contributions as well. 
    3. If you’re interested, fill out this form
  3. Is anyone interested in funding these efforts?
    1. I’d be happy to function as a regrantor. I wouldn’t take a cut myself. I imagine most grants would be between $5-15k to pay for someone building resources full-time for a few months. 
    2. I’m also happy to be more of a coordinator / filter and forward people to funders if a funder is willing to do fast-turnaround grants. 
    3. I think the total amount of funding that would be helpful to get this off the ground is $50k-100k for the next year.

 

I wrote a more detailed post here (relevant parts copied below) but I’m primarily interested in other people’s ideas and preferences.

 

##########################################################

 

Future plans and missing resources 

Note: the following is largely “copy-paste what Neel Nanda did for the mechanistic interpretability community but for evals”. I think his work was great for the field, so why not give it a shot?

Broadly speaking, I want to make it easy and attractive for people to get involved with evals.

  1. Easy: Optimally, getting into evals should be as simple as logging into Netflix - one click, and you’re instantly part of a curated experience. There should be plenty of resources that feel enjoyable and neither too easy nor too hard for your level of experience.
  2. Attractive: When I first read through Neel Nanda’s A Mechanistic Interpretability Analysis of Grokking [LW · GW], I was hooked. I just needed to understand why the network is grokking. I couldn’t help myself, and spent a few days replicating the results and answering some additional questions about grokking. I didn’t do it because I had carefully reasoned that this was the most effective use of my time, I did it because I was intrigued. I think there are many similar situations for evals, e.g. when an LM agent unexpectedly solves a hard eval and you feel like you need to understand what else it is capable of, when it solves the eval in a clever way you hadn’t anticipated, or when you find a new powerful jailbreak. I think so far, the evals field has not done a good job conveying these aspects and that should be changed.

I think the following resources would be good:

If you’re keen to be involved in any of the above, please reach out. In case you want to spend a few months on producing evals materials, I might be able to find funding for it (but no promises). If you’re a funder and want to support these kinds of efforts, please contact me. I would only serve as a regrantor and not take any cut myself.

4 comments

Comments sorted by top scores.

comment by Lukas Petersson (lukas-petersson-1) · 2024-11-16T22:13:00.354Z · LW(p) · GW(p)

This is probably not the first barrier to getting into evals, but I have an AI safety startup that designs evals. However, we don't have the capacity to also do good elicitation. I think we lose a lot of signal from our evals because our agent is too weak to explore properly. We're currently using Inspect's basic_agent. Metr's modular_public is better, but we prefer inspect over vivaria otherwise. I think open-sourcing a better agent would be positive for the evals community without contributing to capabilities.

comment by Daniel Tan (dtch1997) · 2024-11-17T18:39:00.412Z · LW(p) · GW(p)

As someone with very little working knowledge of evals, I think the following open-source resources would be useful for pedagogy

  • A brief overview of the field covering central concepts, goals, challenges
  • A list of starter projects for building skills / intuition
  • A list of more advanced projects that address timely / relevant research needs

Maybe similar in style to https://www.neelnanda.io/mechanistic-interpretability/quickstart

 

It's also hard to understate the importance of tooling that is: 

  • Streamlined: i.e. handles most relevant concerns by default, in a reasonable way, such that new users won't trip on them (e.g. for evals tooling, it would be good to have simple and reasonably effective elicitation strategies available off-the-shelf)
  • Well-documented: both at an API level, and with succinct end-to-end examples of doing important things 

I suspect TransformerLens + associated Colab walkthroughs has had a huge impact in popularising mechanistic interpretability. 

comment by lemonhope (lcmgcd) · 2024-11-17T06:54:32.230Z · LW(p) · GW(p)

Maybe it should be a game that everyone can play

comment by bayesian_kitten · 2024-11-19T01:31:06.760Z · LW(p) · GW(p)

Hi there! I'm Ameya, currently at the University of Tübingen. I share similar broad interests and am particularly enthusiastic about working on evaluations. Would love to be a part of broader evals group if any created (slack/discord)!

We organized an evals workshop recently! It had a broader focus and wasn't specifically related to AI safety, but it was a great experience -- we are planning to keep running more iterations of it and sharpen focus.