Which evals resources would be good?

marius-hobbhahn

Which evals resources would be good?

post by Marius Hobbhahn (marius-hobbhahn) · 2024-11-16T14:24:48.012Z · LW · GW · 4 comments

  Future plans and missing resources 
None
4 comments

I want to make a serious effort to create a bigger evals field. I’m very interested in which resources you think would be most helpful. I’m also looking for others to contribute and potentially some funding.

Which resources would be most helpful? Suggestions include:
1. Open Problems in Evals list: A long list of open relevant problems & projects in evals.
2. More Inspect Tutorials/Examples: Specifically, examples with deep explanations, examples of complex agent setups, and more detailed examples of the components around the evals.
3. Evals playbook: A detailed guide on how to build evals with detailed examples for agentic and non-agentic evals.
4. Salient demos: I don’t even necessarily want to build scary demos, I just want normal people to be less hit-by-a-truck when they learn about LM agent capabilities in the near future.
5. More “day-to-day” evals process resources: A lot of evals expertise is latent knowledge that professional evaluators have accumulated over time.
6. An evals slack
7. Talk through the most important evals papers
8. Evals workshops and conferences
Are you interested in contributing to any of these efforts?
1. Right now I think the following types of contributions would be helpful:
  1. People who want to contribute more detailed Inspect tutorials / Examples as described here. I think this is a great opportunity for people who are considering working full-time on evals and want to test fit / make a name for themselves.
  2. Someone who wants to help me coordinate these efforts part time or full-time. This would be a mix of playing around with evals yourself, coordinating with other actors in evals and probably a decent amount of writing.
2. I’m very interested in other contributions as well.
3. If you’re interested, fill out this form.
Is anyone interested in funding these efforts?
1. I’d be happy to function as a regrantor. I wouldn’t take a cut myself. I imagine most grants would be between $5-15k to pay for someone building resources full-time for a few months.
2. I’m also happy to be more of a coordinator / filter and forward people to funders if a funder is willing to do fast-turnaround grants.
3. I think the total amount of funding that would be helpful to get this off the ground is $50k-100k for the next year.

I wrote a more detailed post here (relevant parts copied below) but I’m primarily interested in other people’s ideas and preferences.

##########################################################

Future plans and missing resources

Note: the following is largely “copy-paste what Neel Nanda did for the mechanistic interpretability community but for evals”. I think his work was great for the field, so why not give it a shot?

Broadly speaking, I want to make it easy and attractive for people to get involved with evals.

Easy: Optimally, getting into evals should be as simple as logging into Netflix - one click, and you’re instantly part of a curated experience. There should be plenty of resources that feel enjoyable and neither too easy nor too hard for your level of experience.
Attractive: When I first read through Neel Nanda’s A Mechanistic Interpretability Analysis of Grokking [LW · GW], I was hooked. I just needed to understand why the network is grokking. I couldn’t help myself, and spent a few days replicating the results and answering some additional questions about grokking. I didn’t do it because I had carefully reasoned that this was the most effective use of my time, I did it because I was intrigued. I think there are many similar situations for evals, e.g. when an LM agent unexpectedly solves a hard eval and you feel like you need to understand what else it is capable of, when it solves the eval in a clever way you hadn’t anticipated, or when you find a new powerful jailbreak. I think so far, the evals field has not done a good job conveying these aspects and that should be changed.

I think the following resources would be good:

Open Problems in Evals list: A long list of open relevant problems & projects in evals. They should be a mix of ready-to-go well-scoped problems for newcomers as well as more complex, less well-scoped problems, e.g. for PhD students to work on. I have started such a list and multiple people have already confirmed that they will contribute to it. If you’re interested in contributing, please reach out.
More Inspect Tutorials/Examples: Inspect already has a repository of examples, which is great. When I’m not familiar with a tool, I typically look for detailed examples and copy-paste or adapt someone else’s ideas. The more examples there are, the higher the chance it overlaps with anyone’s particular use case. The kind of examples I would like to see more of are:
1. Examples with deep explanations, e.g. to get a much better sense of the considerations the evaluator is going through when designing the eval.
2. Examples of complex agent setups, e.g. with multiple agents or new tools, to show how to build them.
3. More detailed examples of the components around the evals, e.g. how to write detailed tests for the different parts of the eval. Another good post would be going through the paper “Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations”, showing results on real evals and making the code available.
Evals playbook: A detailed guide on how to build evals with detailed examples for agentic and non-agentic evals. It should contain our current best guess and be updated over time. At some point, I started such a doc but I think almost all of the benefits come from the nitty gritty details and examples but I don’t have the time to write the details. If someone wants to work with me on this, I might be able to give high-level guidance.
Salient demos: I think many people, including policymakers who follow the space of AI fairly closely, do not have a good understanding of a) what current LM agents are capable of and b) what the world will look like in a year or so when millions of LM agents are widely integrated in the economy. I think it would be great to have more demos that are very salient to people who are not working with frontier LLMs every day. These demos could cover safety-related aspects like scheming and CBRN misuse but I’m primarily looking for something that feels relevant to the life of normal people. I think CivAI has a few good demos in that vein but I’d be even more inclusive with topics. I don’t even necessarily want to build scary demos, I just want normal people to be less hit-by-a-truck when they learn about LM agent capabilities in the near future.
More “day-to-day” evals process resources: A lot of evals expertise is latent knowledge that professional evaluators have accumulated over time. It would be great to get a better sense of how they work on a day-to-day basis. For example, I would welcome a video of someone building a semi-realistic eval for 60-120 minutes or a video of someone running an eval, e.g. to understand which evidence they are looking for, how they decide which scripts to read, how they track interesting findings, and more seemingly mundane aspects like that. This would be along the lines of Neel Nanda’s Real-Time Research Recording: Can a Transformer Re-Derive Positional Info? (though I would want it to be a little bit more prepared and fast-forwarding through the debugging parts xD).
An evals slack: Maybe an evals slack would be nice. I’m currently not sure if the demand is large enough but if I get enough positive signal, I’d make one and put some effort into getting it off the ground, e.g. to make it useful for its members. UK AISI recently launched the Inspect Slack which I will closely follow to assess the demand and feasibility for a more general evals slack independent of a specific framework.
Talk through the most important evals papers: Neel has made some videos of him going through interpretability papers he found interesting. I may do the same for evals papers (or maybe a bit closer to Yannic Kilcher’s style). I’d probably start with the top papers from our opinionated evals reading list but welcome other suggestions.
Evals workshops and conferences: It would be nice to have specific workshops at big ML conferences. I don’t intend to spend any effort on this myself but if someone is keen to organize one, please do.

If you’re keen to be involved in any of the above, please reach out. In case you want to spend a few months on producing evals materials, I might be able to find funding for it (but no promises). If you’re a funder and want to support these kinds of efforts, please contact me. I would only serve as a regrantor and not take any cut myself.

4 comments

Comments sorted by top scores.

comment by Lukas Petersson (lukas-petersson-1) · 2024-11-16T22:13:00.354Z · LW(p) · GW(p)

This is probably not the first barrier to getting into evals, but I have an AI safety startup that designs evals. However, we don't have the capacity to also do good elicitation. I think we lose a lot of signal from our evals because our agent is too weak to explore properly. We're currently using Inspect's basic_agent. Metr's modular_public is better, but we prefer inspect over vivaria otherwise. I think open-sourcing a better agent would be positive for the evals community without contributing to capabilities.

comment by Daniel Tan (dtch1997) · 2024-11-17T18:39:00.412Z · LW(p) · GW(p)

As someone with very little working knowledge of evals, I think the following open-source resources would be useful for pedagogy

A brief overview of the field covering central concepts, goals, challenges
A list of starter projects for building skills / intuition
A list of more advanced projects that address timely / relevant research needs

It's also hard to understate the importance of tooling that is:

Streamlined: i.e. handles most relevant concerns by default, in a reasonable way, such that new users won't trip on them (e.g. for evals tooling, it would be good to have simple and reasonably effective elicitation strategies available off-the-shelf)
Well-documented: both at an API level, and with succinct end-to-end examples of doing important things

I suspect TransformerLens + associated Colab walkthroughs has had a huge impact in popularising mechanistic interpretability.

comment by lemonhope (lcmgcd) · 2024-11-17T06:54:32.230Z · LW(p) · GW(p)

Maybe it should be a game that everyone can play

comment by bayesian_kitten · 2024-11-19T01:31:06.760Z · LW(p) · GW(p)

Hi there! I'm Ameya, currently at the University of Tübingen. I share similar broad interests and am particularly enthusiastic about working on evaluations. Would love to be a part of broader evals group if any created (slack/discord)!

We organized an evals workshop recently! It had a broader focus and wasn't specifically related to AI safety, but it was a great experience -- we are planning to keep running more iterations of it and sharpen focus.

Which evals resources would be good?

Contents

Future plans and missing resources

4 comments