Tips for Empirical Alignment Research

post by Ethan Perez (ethan-perez) · 2024-02-29T06:04:54.481Z · LW · GW · 4 comments

Contents

  What success generally looks like
  Tactical Research Tips & Approach
  Workflow
  Reading research papers:
  General Mindset Tips
  Three modes of research
  Work Habits
  Machine Learning/Engineering Footguns
  Default Norms for Projects with Me
None
4 comments

TLDR: I’ve collected some tips for research that I’ve given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment research. Some of these are opinionated takes, also around what has helped me. Researchers can be successful in different ways, but I still stand by the tips here as a reasonable default.

What success generally looks like

Here, I’ve included specific criteria that strong collaborators of mine tend to meet, with rough weightings on the importance, as a rough north star for people who collaborate with me (especially if you’re new to research). These criteria are for the specific kind of research I do (highly experimental LLM alignment research, excluding interpretability); some examples of research areas where this applies are e.g. scalable oversight, adversarial robustness, chain-of-thought faithfulness, process-based oversight, and model organisms of misalignment. The exact weighting will also vary heavily depending on what role you’re serving on the team/project. E.g., I’d probably upweight criteria where you’re differentially strong or differentially contributing on the team, since I generally guide people towards working on things that line up with their skills. For more junior collaborators (e.g., first time doing a research project, where I’ve scoped out the project), this means I generally weigh execution-focused criteria more than direction-setting criteria (since here I’m often the person doing the direction setting). Also, some of the criteria as outlined below are a really high bar, and e.g. I only recently started to meet them myself after 5 years of doing research and/or I don’t meet other criteria myself. This is mainly written to be a north star for targets to aim for. That said, I think most people can get to a good-to-great spot on these criteria with 6-18 months of trying, and I don’t currently think that many of these criteria are particularly talent/brains bottlenecked vs. just doing a lot of deliberate practice and working to get better on these criteria (I was actively bad at some of the criteria below like implementation speed even ~6 months into me doing research, but improved a lot since then with practice). With that context, here are the rough success criteria I’d outline:

Tactical Research Tips & Approach

For highly empirical research, it’s critical to get quick feedback and iterate on ideas rapidly. Jacob Steinhardt has a great blog post describing that a really good strategy for doing research is to “reduce uncertainty at the fastest possible rate” — with language model and alignment research, you can often reduce uncertainty really quickly, as little as a single message to GPT4 on your phone or Claude in Slack. This can be a huge win over e.g. launching a large training run or set of API calls to get back results, and means you can gain 1+ OOMs more information (per unit time) about what will work well, just by being careful to derisk ideas in the quickest way possible. (Most of the ideas in this doc are focused around this idea, and I’m not discussing project selection which is also important but orthogonal to the details below on how to do research.)

Workflow

Below, I’m including my workflow for “getting models to do something” as quickly as possible. It’s the general strategy I’ve used for prototyping ideas like generating evaluations with LMsred teaming language models with language modelstraining language models with language feedback, etc. (If this workflow doesn’t work directly for some research task you’re doing, e.g. interpretability, then it’s at least an illustrative example of how to prioritize experiments in another setting.)

Try versions of an idea in this order (only skip a step if you have a very strong reason to do so):

Other practical tips:

  1. If looking through lots of sampled text:
    1. Find some automatic, proxy metric to capture what you’re looking for. Looking at raw samples is good, but hard if you’d otherwise need to look at a lot of samples e.g. to tune some sampling hyperparameter.
    2. Do a data-atlas visualization (e.g. UMAP for dimensionality reduction, then make an interactive/hoverable plot with plotly — something like this)
  2. If sample quality isn’t high enough, or the model is making too many mistakes (e.g., on math or coding):
    1. Lower the temperature (e.g. 0, 0.6, or 0.8) or top-p (e.g., 0.8 or 0.95 ).
    2. Use best-of-N sampling (if you aren’t already), with a higher N (8 is alright, 100 is great)
  3. Learn keyboard shortcuts: for basically everything you do. You should rarely touch your mouse/trackpad. Keyboard shortcuts help you not lose your flow if you e.g. need to switch tabs, jump to a new place in the codebase, etc. Pairing with people is a great way to pick up new tips and tricks here!
  4. Always be thinking about what the best next experiment you run should be: When you show experimental results (in meetings or in slack), you should also include discussion of your proposed next possible steps immediately after (and proposed prioritization). The best researchers are able to iterate between running experiments and deciding on the best next step independently. Getting in the practice of proposing the next experiments is helpful for:
    1. Seeding the discussion on what we should do next (the person on the project will often have the most context on what makes sense to try from having looked at the experimental results most closely)
    2. Getting practice at figuring out what the next experiment is to run, and getting your other collaborator’s feedback on how your thinking around what experiment to run next could be improved

Reading research papers:

  1. Generally not very important — low value of information relative to running your own experiments. Exceptions:
    1. You’re starting a new project, and need to learn what’s been done vs. not before, to know where to make a contribution. Also to pick up tips and tricks on the domain you’re in, if there’s actually relevant stuff to what you’re working on.
    2. You’re in the middle of a project, and a really relevant paper comes out (in which case you should read it closely)
    3. You're in an area of alignment where a lot of actually-relevant prior work has been done (e.g. adversarial robustness, backdoors)
  2. Where to find relevant papers:
    1. For being up to date in general, Twitter + what people share on Slack + what your collaborators/colleagues send you is a pretty good 80/20. For Twitter, you can follow some of the people I do (link), especially if any in particular stand out to you. (Aran Komatsuzaki and AK are pretty good sources of interesting LLM papers)
  3. How to read research papers:
    1. For tangentially related research — Just read until you have the main idea: Title+Abstract, Figure 1, maybe the intro if it’s poorly written and you still don’t get what they did. Usually stop there — if reading more seems pretty helpful, then consider skimming the remaining tables/figures, or (in rare cases, if the paper seems very relevant) keep reading the paper like a normal person, until you get the main idea (skip related work)
    2. For directly related research to what you’re doing — Read the whole paper: From start to finish, maybe even checking the appendix where it seems relevant (from looking at the appendix references in the paper). A paper directly in your research area is usually pretty rare and a great gift which can give you a lot of tips for your next project.

General Mindset Tips

  1. Every experiment is a win: What matters is whether or not you’re learning about a problem. If we’re learning, we’re winning, even if the experiment didn’t “work.” If you get weird or you're getting a lot of bits of information about what’s going on. This mindset is both more accurate and sustainable than the alternative (rooting for every experiment to succeed), since you’ll be more robust to the inevitable many cases where your experiment fails. There’s almost always some interesting direction to pivot to (esp. nowadays, when the most recent scale of models are so unexplored). If you’re hitting diminishing returns on the project, then it’s totally fine and great to switch projects — it’s easy to get tunnel-visioned into thinking that the direction you’re working on is the only direction out there, but you’ll usually realize there’s usually no dearth of exciting projects once you start chatting with other people about projects and brainstorming more generally
  2. You’re almost never actually scooped: Junior researchers (including/especially my past self) tend to look at work that other people have done and see it as closer to what they’re actually doing. Especially in alignment, very few people (sadly) are actually working on the same problems as we are, so it’s common for people to do something directionally related (e.g. Chain of Thought) but not directly related / with the same motivation (Chain of Thought Faithfulness).
  3. It’s a win if you are scooped: In the rare case you are actually scooped, that’s a win! Someone else did your research for you, and you can bootstrap off of what they did to answer newer, cooler, and even more frontier questions than before. Also, you’ll get to compare your results to theirs, and there will almost certainly be interesting differences (e.g., they looked at pretrained LMs and you’re looking at RLHF models). Similarities in the results can validate your results/uncertainties or often show different/interesting/surprising things relative to what you’ve found. So it’s often just a gift when someone else has also run the experiments you did (it’s often the best way to get signal on what you’re working on)

Three modes of research

(Most projects will involve some amount of each)

  1. Exploratory phase: (Beginning of a new project)
    1. Talk with various people on your team to get a sense of what the important problems to work on are, and which ones are tractable. For junior researchers working with someone more senior, take their guidance on the general direction and explore ideas at the medium- or low- level (e.g., “given that I’m working on X topic/direction, what are the methods that seem likely to work best given prior work”)
    2. Run some quick experiments to prototype various ideas, using the workflow described earlier (anything you can run in a day or less, e.g. preferring faster-to-implement/run experiments). Write these up and get feedback to see how promising the idea is to keep pursuing, if there are lots of follow-up ideas, etc.
    3. Read/skim papers to get a sense of what’s been done in the area(s) you’re thinking about. (Talking with people can often be superior though, in that they’ll often be able to point you to what’s been done or related projects others are doing)
  2. Execution phase: (Vast majority of the time on a project)
    1. Generally aim to always have some experiment running, basically 24/7. This doesn’t always make sense (e.g., if your experiments run super quickly), but if your experiments take ≥8h, you should have experiments running at least 50% of the time (e.g., running something overnight, looking at the results + implementing a follow-up during the day).
    2. Tailor your experiments to take no longer than ~16h if you’re running them frequently. This lets you run the following loop:
      1. Run an experiment overnight
      2. Look at the results in the morning, figure out what to run next, implement it, and then go back to (i)
    3. Anything longer than 16h makes it really hard to get quick feedback, and it’s almost never worth it. Tips for reducing the runtime of your job:
      1. If you’re running on a model with slow latency (e.g., GPT4 or LLAMA 65B), think hard if there’s a way to run it on a smaller model (e.g. GPT3.5 or LLAMA 7B), or faster somehow (e.g., using quantization during inference/finetuning, LoRA during finetuning, using SGD instead of Adam to use less GPU memory, fewer epochs but larger learning rates and smaller batch sizes, etc.)
      2. If you’re running on the full test set of some dataset(s), try reducing the number of examples you’re running on to 1k or even 300. (300 is the minimum for e.g. plotting scaling laws and getting clean trends in some experiments we did for the inverse scaling prize, and I probably wouldn’t recommend going lower if you want clean signal, 1k is pretty safe)
      3. If you’re running an RL experiment, try a Best-of-N (BoN) version (or at least try BoN versions first, to derisk your RL experiment) – BoN will let you see what high reward samples look like (without requiring any training or fancy code). That will quickly tell you if there’s anything wrong e.g. with your reward function
      4. Shorten your prompt, e.g., by using fewer (but maybe more carefully-chosen) few-shot examples, or by using shorter few-shot examples. Or potentially use a very short few-shot prompt (but with a larger N for BoN sampling)
      5. If you’re sampling and using custom/unusual stop sequences/tokens, make sure those stop sequences are getting hit when you sample (so you’re not accidentally sampling more than you expect)
      6. Generate fewer tokens, e.g., by reducing your max number of tokens sampled, and/or getting the model to do the task while needing to output a smaller number of tokens (or biasing it with few-shot examples which indicate the model should give shorter responses)
  3. Writing phase: For communication internally or externally.
    1. Internal communication: I’ll often just start a doc as I’m running experiments, and then add info about the experimental results as they come in (e.g. various plots/stats, and my observations/takeaways about the results — see this example of research log I’ve kept in the past). It’s helpful to keep a record for myself, for discussing in weekly meetings, and for pointing others to. Depending on how detailed your experimental log is, you might want to make a more cleaned up version for others’ consumption (e.g., weekly meetings). Probably fine to not spend too much time on this (and just spend time as a function of the number and amount of context that people reading it will have)
    2. External communication: Usually writing a paper for arXiv — this takes usually 2-4+ weeks (and on the longer end if it’s your first time), not just strictly because of the writing itself, but because you’ll need to run more experiments to really clarify your results (e.g., maybe you have the main results, but you’ll realize that there are some missing experiments to really know what’s going on or make the point you’re trying to make clearly, and then you’ll need to run those).
    3. Writing Tips: See https://ethanperez.net/easy-paper-writing-tips/ for good ML paper writing tips. Main text includes mostly style/clarity tips, with links to more substantial recommendations for Computer Science paper writing at the bottom. I’d recommend it for anyone writing papers to a machine learning audience, and require it for any paper where I’m the main supervisor. If you want examples of papers that follow these guidelines (and general good style for ML-accessible papers), you can read any of my most recent first-author papers (where I try to incorporate all of the things I’ve learned about writing ML papers, in particular my paper on red teaming language models with language models)

Work Habits

  1. Hard work pays off a lot: The kind of very empirical/experimental work that is typical of alignment work on large language models just benefits a lot from running as many experiments as you can, tinkering a lot, and trying lots of stuff (more so than being smart or knowledgeable in a lot of cases). Often, there are a lot of reasonable-sounding ideas to try, and it’s just actually unclear what will work, so you need to take a lot of shots on goal to find something that works (and e.g. the 5th or 20th thing on the list of things to try is the first one that works). As a bonus, you get a lot better at figuring out what experiments to run, picking the right experiment to run the first time, etc., so there’s a strong rich get richer effect here (which is especially important for junior researchers in getting momentum). Also, since more things work when you try more stuff, it’s easier to stay motivated (since you minimize the amount of time that nothing’s working). It also goes without saying that what matters is the number of productive hours you’re spending (often in empirical research, basically how much time you’re spending coding and running experiments), rather than the absolute number of hours, and it’s often easier to optimize how productive your hours are over increasing the amount of time you’re working.
  2. Work sustainably: That said, it’s also really, really important to make sure you’re working sustainably, and this is a particular luxury of doing research as opposed to e.g. product/engineering (no immediate deadlines). Lots of researchers burn out by working in unsustainable ways (also since the object-level work can be hard and goals ill-defined), or by developing an aversion to what they do, and it’s really good to avoid this (also, burn out just sucks). I (and I think other researchers) often build up towards working more hours over the course of months or years (only increasing hours when it feels comfortable and great to do so)

Machine Learning/Engineering Footguns

  1. A single preference model (PM) score in isolation doesn’t have a clear interpretation — PM scores are only trained to make sense when compared to other PM scores given the same context. If the context is different, it’s unclear how to interpret the PM scores in comparison to each other.
    1. To make PM scores more comparable across different contexts, it’s common to construct a reference response (”I’m sorry, I can’t help with that.”), and then use the difference between the PM score on the current response and the reference response. This measure effectively gets you something like a probability that the current response is better than the reference response). To get an actual probability, you should treat the PM score of the current response and reference response as a logit, and then take a softmax over both logits. This will lead to a calibrated probability (since this is how the PM is trained, and our PMs are good at producing calibrated probabilities)
  2. The probabilities from LMs and PMs are calibrated (for PMs, you’ll need to compare a response against a reference), but not for RLHF models
    1. In general, don’t try to interpret the actual raw/continuous probabilities from RLHF model output distributions. The rankings of tokens are meaningful (basically since it’s the RLHF model’s prediction of which token would get higher reward), but actual probabilities can be viewed as just an artifact of undertraining (if you’re not using a KL penalty, at least). A fully-trained RLHF model should be deterministic, and probabilities are just a way for earlier checkpoints in RLHF to explore into higher reward policies.
    2. Example: When using an RLHF model to classify some text, the top-1 predictions make sense to look at but not e.g. the average probabilities on the RLHF model. If you want to look at probabilities, use a prompted LM or PM (I typically find good results with a PM), or tune the temperature of the softmax used for computing the output distribution of your RLHF model (to calibrate the probabilities, as in Kadavath et al. 2022)

Default Norms for Projects with Me

Below are the default norms I follow for most external-to-Anthropic projects I supervise (might be helpful for both new researchers and new researcher mentors e.g. for SERI MATS). I think they're reasonable defaults (especially if you're not sure where to start project norms or project management -wise).

The working style here is particularly tailored towards junior collaborators, so if you’re a more experienced collaborator (e.g., 2+ first or co-first author ML papers under your belt), feel free to work in a more independent manner than described below. Any of these are up for discussion e.g. if someone on the project thinks a different way of working together would work better, relative to what's described below.

  1. Please default to presenting results during weekly meetings as a slideshow (e.g., including all of the plots, tables, and text that you’d like to discuss). This helps to streamline the discussion significantly, minimize the amount of time that is needed to dig up some relevant results, and also helps to focus on the most relevant points where you’d like feedback from me. I personally also find it much easier to pay attention to than e.g. purely verbal discussion of results or results presented in a doc with a lot of text, though this probably varies a little by supervisor (a doc of results is probably the next best option, and can be fine if you prefer).
  2. It helps to have a concrete agenda for what to discuss during meetings (can take as little as 5-10 minutes to think about before a meeting, or up to e.g. 1 hour if it involves more explanation/plotting/etc. and the meeting is short / there’s a lot to cover). In particular, it helps to have:
    1. Plots, tables, or other concrete results to show (e.g., ideally in a notion or google doc), so there are specific things to discuss, point at, and look at. Concrete results help to ground the discussion and give high enough information to people who are supervising the project to get useful feedback, notice things that you might not have, etc. — helpful for catching things that talking over results out loud / at a high-level wouldn’t catch
    2. A concrete list of proposed, prioritized next steps, given your existing results and the overall project goal. Even if you’re unsure of what to do next, it’s easier for others at the meeting to discriminate (rank a proposed list of steps) than to generate (ideas for what next steps to take), especially given that they don’t have as much context as you do on the project
    3. A concrete list of questions and places where it’d be helpful to get input from others at the meeting
    4. A sense of how long to spend discussing each point above, so that everything gets discussed without running over time. It’s common to spend disproportionately more time on points brought up early on in a meeting, since there’s a feeling of more time, which leaves much less time (or no time) for other important discussion points. Alternatively, bring the most important points up for discussion at the start of a meeting.
  3. I (and I think many other research supervisors) generally leave it to the main people actively working on the project / running experiments to determine what to discuss and present, since they have the most context on what needs to be discussed. I’ll sometimes come with specific questions to the meeting, but by default I won’t.
  4. If you have a conflict for a meeting (like a 1:1), please try to cancel or reschedule the calendar invite in advance, as far in advance as you can. For meetings that are in the morning in my time zone, please try to cancel the previous day (so I can get some extra sleep :) ) – otherwise, it’s pretty low-cost to cancel a meeting last-minute (since I’ll just get the time back).
  5. Feel free to book a time on my calendar any time you think there’s something that’d be helpful to chat about (feel free to book liberally, and I’ll let you know if the frequency is too high – that’s basically never happened though, and the usual failure mode is that people underbook meetings, especially 1:1s)
  6. General rules for scheduling meetings/invites:
    1. Make them modifiable by anyone on the invite
    2. Have a Google Meet link attached
    3. Have a specific where-to-meet location (e.g. booking a physical room if we're in the same office space)
    4. Get Google Meet premium so meetings don't end at 1h
    5. "picture-in-picture mode" with google meet lets you see people's faces while sharing your screen / being on other tabs, which is helpful for being able to see people's reactions while presenting results / sharing screen
    6. Generally get comfortable with different screen sharing options, e.g., sharing tab-only, full window, joining in presentation mode, etc., so that if there are any hiccups, you can quickly fix any issues that come up
    7. (A bunch of the above can be automatic/defaults, so it can be a one-time change)

4 comments

Comments sorted by top scores.

comment by Henry Sleight (ResentHighly) · 2024-02-29T18:01:18.780Z · LW(p) · GW(p)

First off: as one of Ethan's current Astra Fellows (and having worked with him since ~last October) I especially think his collaborators in MATS and Astra historically underweight how valuable overcommunicating with Ethan is, and routinely underbook meetings to ask for his support.

Second, I think this post is so dense with useful advice, so I made anki flashcards of Ethan's post using GPT-4 (generated via ankibrain [https://ankiweb.net/shared/info/1915225457] , small manual edits.)

You can find them here: https://drive.google.com/file/d/1G4i7iZbILwAiQ7FtasSoLx5g7JIOWgeD/view?usp=sharing

comment by qxcv · 2024-03-02T00:35:59.640Z · LW(p) · GW(p)

For highly empirical research, it’s critical to get quick feedback and iterate on ideas rapidly. Jacob Steinhardt has a great blog post describing that a really good strategy for doing research is to “reduce uncertainty at the fastest possible rate”

Michael Bernstein's slides on velocity are a great resource for learning this mindset this as well. I particularly like his metaphor of the "swamp". This is the place you get stuck when you really want technique X to work for the project to progress, but none of the ways that you've tried applying it have succeeded. The solution is to have high velocity: that is, to test out as many ideas as possible per unit time until you get out the swamp. Other highlights of the slide deck include the focus on answering questions rather than doing engineering, and the related core-periphery distinction between things that are strictly needed to answer a question & those that can be ignored/mocked up/replaced for testing (which echoes the ideas in the "workflow" section of this post). 

(Although they're similar, I'd argue that Michael's approach is easier to apply to empirical alignment research than Jacob's "stochastic decision process" approach. That's because falsifying abstract research ideas in empirical deep learning is hard (impossible?), and you don't get much generalizable knowledge from failing to get one idea to work. The real aim is to find one deep insight that does generalize—hence the focus on trying many distinct approaches.)

Replies from: ethan-perez
comment by Ethan Perez (ethan-perez) · 2024-03-03T04:38:14.416Z · LW(p) · GW(p)

Yeah, I think this is one of the ways that velocity is really helpful. I'd probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there's much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you've probably run through the low-hanging fruit on that problem/approach, and there's other low-hanging in other areas that you probably want to be picking instead.

(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it's useful to have either way)

comment by Review Bot · 2024-03-01T21:10:05.670Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?