Tips for Empirical Alignment Research

ethan-perez

Tips for Empirical Alignment Research

post by Ethan Perez (ethan-perez) · 2024-02-29T06:04:54.481Z · LW · GW · 4 comments

  What success generally looks like
  Tactical Research Tips & Approach
  Workflow
  Reading research papers:
  General Mindset Tips
  Three modes of research
  Work Habits
  Taking health seriously
  Machine Learning/Engineering Footguns
  Default Norms for Projects with Me
None
4 comments

TLDR: I’ve collected some tips for research that I’ve given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment research. Some of these are opinionated takes, also around what has helped me. Researchers can be successful in different ways, but I still stand by the tips here as a reasonable default.

What success generally looks like

Here, I’ve included specific criteria that strong collaborators of mine tend to meet, with rough weightings on the importance, as a rough north star for people who collaborate with me (especially if you’re new to research). These criteria are for the specific kind of research I do (highly experimental LLM alignment research, excluding interpretability); some examples of research areas where this applies are e.g. scalable oversight, adversarial robustness, chain-of-thought faithfulness, process-based oversight, and model organisms of misalignment. The exact weighting will also vary heavily depending on what role you’re serving on the team/project. E.g., I’d probably upweight criteria where you’re differentially strong or differentially contributing on the team, since I generally guide people towards working on things that line up with their skills. For more junior collaborators (e.g., first time doing a research project, where I’ve scoped out the project), this means I generally weigh execution-focused criteria more than direction-setting criteria (since here I’m often the person doing the direction setting). Also, some of the criteria as outlined below are a really high bar, and e.g. I only recently started to meet them myself after 5 years of doing research and/or I don’t meet other criteria myself. This is mainly written to be a north star for targets to aim for. That said, I think most people can get to a good-to-great spot on these criteria with 6-18 months of trying, and I don’t currently think that many of these criteria are particularly talent/brains bottlenecked vs. just doing a lot of deliberate practice and working to get better on these criteria (I was actively bad at some of the criteria below like implementation speed even ~6 months into me doing research, but improved a lot since then with practice). With that context, here are the rough success criteria I’d outline:

[70%] Getting ideas to work quickly
- [45%] Implementation speed
  - Able to quickly implement a well-scoped idea. An example of doing really well here is if we talk about an idea one day and decide it’s exciting/worth doing, and you tell me the next day whether it worked
  - Able to run a high volume of experiments. You’re doing really well here if it’s hard for your supervisor to keep up with the volume of the experiments/results you’re showing; 30m or even 60m weekly 1:1 meetings should feel like not long enough to discuss all of the results you have, and you have to filter what we discuss in our weekly meetings to just the most important and decision-relevant results. If some experiments take a while to run, you’re running a lot of other project-relevant experiments in parallel or implementing the next experiment (Exceptions: the experiments you’re running take more than overnight/18h to run and there’s no way to design them to be shorter; or the experiments are very implementation-heavy)
  - Able to design a minimal experiment to test a mid/high-level idea. You run experiments in a way such that you’re rarely compute or experiment-time bottlenecked (especially early in a project), and your experiments are designed to be easy/quick to implement
  - You trade off code quality and implementation speed in the best way for long-run productivity. You bias heavily towards speed in general, except when you notice the project is significantly slowed down by a lack of code quality / good tooling, in which case you’re able to speed up the project by making relevant code improvements
  - Noticing when you’re going slowly and quickly and shamelessly asking for help
- [25%] Ability to get things to work:
  - Things “just work” when you implement them and run them. If you’re trying to get a number to go up, that number goes up. After iterating, your method is the SOTA method.
  - You suggest and choose good next steps fairly independently in between weekly meetings. If I'm supervising your project and we don’t chat between weekly meetings, I’d strongly endorse the decisions you made throughout the week, around what things to try. You’ll have tried great ideas that I hadn’t thought of, because you’re spending more time thinking about the problem and looking at results.
  - You diagnose why something’s not working and make fixes to get things to work
  - You’re able to get ideas working that other people couldn’t, by being effective at coming up with good ideas for quick things to try, predicting which ones will work, and/or trying lots of ideas
[20%] Driving the project direction
- [10%] Medium/low-level, day-to-day direction
  - Knowing/determining well-motivated research questions are important to answer, and prioritizing/designing experiments to answer those questions
  - Making good decisions about what approach to take, evaluation metrics, etc. Running the right version of an experiment the first time
  - Effective at working independently, e.g., 1h/week meeting + Slack DMs/tags are sufficient to keep you on track and ensure you’re doing maximally valuable work
  - Noticing when you’re not sure you’re tackling the most important research questions or not sure about the next step and quickly asking for help
  - In our weekly meetings, you proactively propose/suggest great next steps for what we should do the next week, with a good prioritization of those steps. The next steps you propose are better than what I’d suggest, and I basically just give you a thumbs up each week, or fairly minor feedback.
- [10%] High-level/conceptual direction
  - Able to independently determine what research directions are important to pursue, either by (1) proactively sourcing ideas from others (including those not on our project) and filtering them for quality/importance/tractability, or (2) figuring out what seems important based on your own ideas/reading/experiments.
  - Noticing when we could probably answer a more important research question or when we’re somewhat lost in direction, then taking the initiative to deconfuse us and e.g. write a doc, lead a discussion, have a 1:1 chat with other team members, or organize a meeting with external discussion to unblock us
  - You’re able to spot great project ideas, and (even better) “just got for it,” try it out without asking, and show a working prototype, effectively derisking new ideas/directions on your own.
[5%] Communicating ideas clearly
- Your slack messages/plots and live presentations/discussions of results are clear and easy to understand. It takes very little time to process information you send over Slack, and a minimal number of back and forths to understand the results (separate from discussing their implications). You find that I and other project supervisors are giving helpful or very helpful feedback during our weekly meetings.
[5%] Other – Varies by person, but includes things like:
- You’re a great teammate, e.g. helping others out where you notice good opportunities, taking initiative to improve things that you see could be improved, do what needs to get done on the project even if the work isn’t as exciting, finding great collaborators for the project
- You’re easy to manage. Not necessarily in terms of time needed but along other axes: how receptive you are to feedback, how much emotional energy you add vs. require from members of the team/me in our interactions, whether you’re transparent/communicative about issues you’re facing, whether we have a high trust relationship where it’s easy to discuss various topics
- You’re great at noticing and calling out room for improvement, e.g. in how we’re working together, things I could be doing better, ways our team could be coordinating better

Tactical Research Tips & Approach

For highly empirical research, it’s critical to get quick feedback and iterate on ideas rapidly. Jacob Steinhardt has a great blog post describing that a really good strategy for doing research is to “reduce uncertainty at the fastest possible rate” — with language model and alignment research, you can often reduce uncertainty really quickly, as little as a single message to GPT4 on your phone or Claude in Slack. This can be a huge win over e.g. launching a large training run or set of API calls to get back results, and means you can gain 1+ OOMs more information (per unit time) about what will work well, just by being careful to derisk ideas in the quickest way possible. (Most of the ideas in this doc are focused around this idea, and I’m not discussing project selection which is also important but orthogonal to the details below on how to do research.)

Workflow

Below, I’m including my workflow for “getting models to do something” as quickly as possible. It’s the general strategy I’ve used for prototyping ideas like generating evaluations with LMs, red teaming language models with language models, training language models with language feedback, etc. (If this workflow doesn’t work directly for some research task you’re doing, e.g. interpretability, then it’s at least an illustrative example of how to prioritize experiments in another setting.)

Try versions of an idea in this order (only skip a step if you have a very strong reason to do so):

Zero-shot, high-volume playgrounding your idea in a chat interface (e.g., ChatGPT, Claude.ai, or Gemini):
1. Send 10-100 messages to various models, to investigate the behavior in question or prototype your idea (e.g., do language models have some form of situational awareness, or can language models be used to generate evaluation data)
2. Update the prompt based on the behavior/mistakes you see — e.g., be very explicit about the behavior you want, include all of the context of what you’re trying to do in the prompt, include explicit instructions to guide the model away from common mistakes it makes, etc. There are some relevant guides from Anthropic and OpenAI on how to prompt models, which have helpful tips.
Manual few-shot prompting, playgrounding in a chat interface:
1. Add 1-10 gold examples of what you want the model to do, and see how that improves the model’s behavior on your task
Few-shot prompting — find a source of labeled data for your task, and put as many of those examples into your prompt (maybe even max-ing out the context length for your model)
Best-of-N (BoN) sampling — sample N times and pick the best sample according to a classifier (e.g., prompted language model) or Reward/Preference Model (PM). Generally improves the quality of samples (sometimes by quite a bit) and often gets comparable or better reward to RL (e.g., in the WebGPT paper). Doesn’t need too much hyperparameter tuning — temperature of 1 and top-p of 1 work reasonably well. Lower the temperature (e.g. 0.8 ) and top-p (e.g. 0.95) if you need to get samples of higher quality
Supervised Finetuning — Can take some hyperparameter tuning to get working well. The most important hyperparameters are typically learning rate, batch size, and number of epochs. OpenAI's finetuning API sets pretty good defaults and has tips for how to adjust them (and it's easy to use) (see here for their guidelines on how they set hyperparameters, if you're interested in finetuning other models). I'd usually recommend starting with that (and then switching to open-source models if needed after). Other tips:
1. Make sure your test loss goes below random chance. You can compute the test loss by evaluating the log prob of the uniform distribution over answers; e.g., if you’re doing 2-way classification, the uniform distribution will give 50% probability to each answer choice, so the baseline loss will be -log(0.5)
2. Look at both loss and accuracy — sometimes, these will show different results, and e.g. test loss will start to get worse while accuracy gets better.
RL(HF) — Generally a last resort, since it takes longer/more compute to run experiments and also requires more complicated code.

Other practical tips:

If looking through lots of sampled text:
1. Find some automatic, proxy metric to capture what you’re looking for. Looking at raw samples is good, but hard if you’d otherwise need to look at a lot of samples e.g. to tune some sampling hyperparameter.
2. Do a data-atlas visualization (e.g. UMAP for dimensionality reduction, then make an interactive/hoverable plot with plotly — something like this)
If sample quality isn’t high enough, or the model is making too many mistakes (e.g., on math or coding):
1. Lower the temperature (e.g. 0, 0.6, or 0.8) or top-p (e.g., 0.8 or 0.95 ).
2. Use best-of-N sampling (if you aren’t already), with a higher N (8 is alright, 100 is great)
Learn keyboard shortcuts: for basically everything you do. You should rarely touch your mouse/trackpad. Keyboard shortcuts help you not lose your flow if you e.g. need to switch tabs, jump to a new place in the codebase, etc. Pairing with people is a great way to pick up new tips and tricks here!
Always be thinking about what the best next experiment you run should be: When you show experimental results (in meetings or in slack), you should also include discussion of your proposed next possible steps immediately after (and proposed prioritization). The best researchers are able to iterate between running experiments and deciding on the best next step independently. Getting in the practice of proposing the next experiments is helpful for:
1. Seeding the discussion on what we should do next (the person on the project will often have the most context on what makes sense to try from having looked at the experimental results most closely)
2. Getting practice at figuring out what the next experiment is to run, and getting your other collaborator’s feedback on how your thinking around what experiment to run next could be improved
Plot your data with log-scaled axes, and look for clean trends (e.g. power laws): Many (most?) plots involving results about LLMs probably should have one or both axes log-scaled, e.g. if they plot numbers related to parameter count, amount of data, amount of compute, loss, and many others. Moreover, plotting with log-scale x and y axes lets you spot power laws, which look like lines on a log-log plot.
1. One common mistake is to not notice when you're plotting a power law, because you're plotting the wrong metric or wrong axes. For example, power laws often show up when plotting loss vs. some other metric (e.g. model training compute). But you might be plotting accuracy instead of loss, or you might be plotting loss, but without additionally log-scaling both axes. This could make you miss important observations like that your scaling trends are following a very predictable and extrapolatable pattern.
2. In the best-of-N jailbreaking paper, we only observed power laws by plotting our jailbreaking algorithm's -log(attack success rate) (rather than attack success rate directly), and then we needed to plot this metric on a log-scale (y-axis) against log-scaled test-time compute (x-axis), in order to observe a power law.

Reading research papers:

Generally not very important — low value of information relative to running your own experiments. Exceptions:
1. You’re starting a new project, and need to learn what’s been done vs. not before, to know where to make a contribution. Also to pick up tips and tricks on the domain you’re in, if there’s actually relevant stuff to what you’re working on.
2. You’re in the middle of a project, and a really relevant paper comes out (in which case you should read it closely)
3. You're in an area of alignment where a lot of actually-relevant prior work has been done (e.g. adversarial robustness, backdoors)
Where to find relevant papers:
1. For being up to date in general, Twitter + what people share on Slack + what your collaborators/colleagues send you is a pretty good 80/20. For Twitter, you can follow some of the people I do (link), especially if any in particular stand out to you. (Aran Komatsuzaki and AK are pretty good sources of interesting LLM papers)
How to read research papers:
1. For tangentially related research — Just read until you have the main idea: Title+Abstract, Figure 1, maybe the intro if it’s poorly written and you still don’t get what they did. Usually stop there — if reading more seems pretty helpful, then consider skimming the remaining tables/figures, or (in rare cases, if the paper seems very relevant) keep reading the paper like a normal person, until you get the main idea (skip related work)
2. For directly related research to what you’re doing — Read the whole paper: From start to finish, maybe even checking the appendix where it seems relevant (from looking at the appendix references in the paper). A paper directly in your research area is usually pretty rare and a great gift which can give you a lot of tips for your next project.

General Mindset Tips

Every experiment is a win: What matters is whether or not you’re learning about a problem. If we’re learning, we’re winning, even if the experiment didn’t “work.” If you get weird or you're getting a lot of bits of information about what’s going on. This mindset is both more accurate and sustainable than the alternative (rooting for every experiment to succeed), since you’ll be more robust to the inevitable many cases where your experiment fails. There’s almost always some interesting direction to pivot to (esp. nowadays, when the most recent scale of models are so unexplored). If you’re hitting diminishing returns on the project, then it’s totally fine and great to switch projects — it’s easy to get tunnel-visioned into thinking that the direction you’re working on is the only direction out there, but you’ll usually realize there’s usually no dearth of exciting projects once you start chatting with other people about projects and brainstorming more generally
You’re almost never actually scooped: Junior researchers (including/especially my past self) tend to look at work that other people have done and see it as closer to what they’re actually doing. Especially in alignment, very few people (sadly) are actually working on the same problems as we are, so it’s common for people to do something directionally related (e.g. Chain of Thought) but not directly related / with the same motivation (Chain of Thought Faithfulness).
It’s a win if you are scooped: In the rare case you are actually scooped, that’s a win! Someone else did your research for you, and you can bootstrap off of what they did to answer newer, cooler, and even more frontier questions than before. Also, you’ll get to compare your results to theirs, and there will almost certainly be interesting differences (e.g., they looked at pretrained LMs and you’re looking at RLHF models). Similarities in the results can validate your results/uncertainties or often show different/interesting/surprising things relative to what you’ve found. So it’s often just a gift when someone else has also run the experiments you did (it’s often the best way to get signal on what you’re working on)

Three modes of research

(Most projects will involve some amount of each)

Exploratory phase: (Beginning of a new project)
1. Talk with various people on your team to get a sense of what the important problems to work on are, and which ones are tractable. For junior researchers working with someone more senior, take their guidance on the general direction and explore ideas at the medium- or low- level (e.g., “given that I’m working on X topic/direction, what are the methods that seem likely to work best given prior work”)
2. Run some quick experiments to prototype various ideas, using the workflow described earlier (anything you can run in a day or less, e.g. preferring faster-to-implement/run experiments). Write these up and get feedback to see how promising the idea is to keep pursuing, if there are lots of follow-up ideas, etc.
3. Read/skim papers to get a sense of what’s been done in the area(s) you’re thinking about. (Talking with people can often be superior though, in that they’ll often be able to point you to what’s been done or related projects others are doing)
Execution phase: (Vast majority of the time on a project)
1. Generally aim to always have some experiment running, basically 24/7. This doesn’t always make sense (e.g., if your experiments run super quickly), but if your experiments take ≥8h, you should have experiments running at least 50% of the time (e.g., running something overnight, looking at the results + implementing a follow-up during the day).
2. Tailor your experiments to take no longer than ~16h if you’re running them frequently. This lets you run the following loop:
  1. Run an experiment overnight
  2. Look at the results in the morning, figure out what to run next, implement it, and then go back to (i)
3. Anything longer than 16h makes it really hard to get quick feedback, and it’s almost never worth it. Tips for reducing the runtime of your job:
  1. If you’re running on a model with slow latency (e.g., GPT4 or LLAMA 65B), think hard if there’s a way to run it on a smaller model (e.g. GPT3.5 or LLAMA 7B), or faster somehow (e.g., using quantization during inference/finetuning, LoRA during finetuning, using SGD instead of Adam to use less GPU memory, fewer epochs but larger learning rates and smaller batch sizes, etc.)
  2. If you’re running on the full test set of some dataset(s), try reducing the number of examples you’re running on to 1k or even 300. (300 is the minimum for e.g. plotting scaling laws and getting clean trends in some experiments we did for the inverse scaling prize, and I probably wouldn’t recommend going lower if you want clean signal, 1k is pretty safe)
  3. If you’re running an RL experiment, try a Best-of-N (BoN) version (or at least try BoN versions first, to derisk your RL experiment) – BoN will let you see what high reward samples look like (without requiring any training or fancy code). That will quickly tell you if there’s anything wrong e.g. with your reward function
  4. Shorten your prompt, e.g., by using fewer (but maybe more carefully-chosen) few-shot examples, or by using shorter few-shot examples. Or potentially use a very short few-shot prompt (but with a larger N for BoN sampling)
  5. If you’re sampling and using custom/unusual stop sequences/tokens, make sure those stop sequences are getting hit when you sample (so you’re not accidentally sampling more than you expect)
  6. Generate fewer tokens, e.g., by reducing your max number of tokens sampled, and/or getting the model to do the task while needing to output a smaller number of tokens (or biasing it with few-shot examples which indicate the model should give shorter responses)
Writing phase: For communication internally or externally.
1. Internal communication: I’ll often just start a doc as I’m running experiments, and then add info about the experimental results as they come in (e.g. various plots/stats, and my observations/takeaways about the results — see this example of research log I’ve kept in the past). It’s helpful to keep a record for myself, for discussing in weekly meetings, and for pointing others to. Depending on how detailed your experimental log is, you might want to make a more cleaned up version for others’ consumption (e.g., weekly meetings). Probably fine to not spend too much time on this (and just spend time as a function of the number and amount of context that people reading it will have)
2. External communication: Usually writing a paper for arXiv — this takes usually 2-4+ weeks (and on the longer end if it’s your first time), not just strictly because of the writing itself, but because you’ll need to run more experiments to really clarify your results (e.g., maybe you have the main results, but you’ll realize that there are some missing experiments to really know what’s going on or make the point you’re trying to make clearly, and then you’ll need to run those).
3. Writing Tips: See https://ethanperez.net/easy-paper-writing-tips/ for good ML paper writing tips. Main text includes mostly style/clarity tips, with links to more substantial recommendations for Computer Science paper writing at the bottom. I’d recommend it for anyone writing papers to a machine learning audience, and require it for any paper where I’m the main supervisor. If you want examples of papers that follow these guidelines (and general good style for ML-accessible papers), you can read any of my most recent first-author papers (where I try to incorporate all of the things I’ve learned about writing ML papers, in particular my paper on red teaming language models with language models)

Work Habits

Hard work pays off a lot: The kind of very empirical/experimental work that is typical of alignment work on large language models just benefits a lot from running as many experiments as you can, tinkering a lot, and trying lots of stuff (more so than being smart or knowledgeable in a lot of cases). Often, there are a lot of reasonable-sounding ideas to try, and it’s just actually unclear what will work, so you need to take a lot of shots on goal to find something that works (and e.g. the 5th or 20th thing on the list of things to try is the first one that works). As a bonus, you get a lot better at figuring out what experiments to run, picking the right experiment to run the first time, etc., so there’s a strong rich get richer effect here (which is especially important for junior researchers in getting momentum). Also, since more things work when you try more stuff, it’s easier to stay motivated (since you minimize the amount of time that nothing’s working). It also goes without saying that what matters is the number of productive hours you’re spending (often in empirical research, basically how much time you’re spending coding and running experiments), rather than the absolute number of hours, and it’s often easier to optimize how productive your hours are over increasing the amount of time you’re working.
Work sustainably: That said, it’s also really, really important to make sure you’re working sustainably, and this is a particular luxury of doing research as opposed to e.g. product/engineering (no immediate deadlines). Lots of researchers burn out by working in unsustainable ways (also since the object-level work can be hard and goals ill-defined), or by developing an aversion to what they do, and it’s really good to avoid this (also, burn out just sucks). I (and I think other researchers) often build up towards working more hours over the course of months or years (only increasing hours when it feels comfortable and great to do so)

Taking health seriously

COVID and flu:

Get your COVID/flu shots: One of the highest ROI interventions many people don't do. There's nothing worse than getting sick right before a paper deadline or getting your whole group of collaborators sick.
Rest aggressively if you're sick, especially with COVID: Long COVID can potentially take you out for months or indefinitely, and exerting one's self too early in recovery (physically or mentally) is one of the main risk factors for long COVID. So if you have COVID, you should be resting very aggressively. It can be very tempting to be checking on or running experiments while you're sick, but it's really not worth the risk of extending your sickness and also long-term health effects. Most people I work with don't rest enough while sick.

Machine Learning/Engineering Footguns

A single preference model (PM) score in isolation doesn’t have a clear interpretation — PM scores are only trained to make sense when compared to other PM scores given the same context. If the context is different, it’s unclear how to interpret the PM scores in comparison to each other.
1. To make PM scores more comparable across different contexts, it’s common to construct a reference response (”I’m sorry, I can’t help with that.”), and then use the difference between the PM score on the current response and the reference response. This measure effectively gets you something like a probability that the current response is better than the reference response). To get an actual probability, you should treat the PM score of the current response and reference response as a logit, and then take a softmax over both logits. This will lead to a calibrated probability (since this is how the PM is trained, and our PMs are good at producing calibrated probabilities)
The probabilities from LMs and PMs are calibrated (for PMs, you’ll need to compare a response against a reference), but not for RLHF models
1. In general, don’t try to interpret the actual raw/continuous probabilities from RLHF model output distributions. The rankings of tokens are meaningful (basically since it’s the RLHF model’s prediction of which token would get higher reward), but actual probabilities can be viewed as just an artifact of undertraining (if you’re not using a KL penalty, at least). A fully-trained RLHF model should be deterministic, and probabilities are just a way for earlier checkpoints in RLHF to explore into higher reward policies.
2. Example: When using an RLHF model to classify some text, the top-1 predictions make sense to look at but not e.g. the average probabilities on the RLHF model. If you want to look at probabilities, use a prompted LM or PM (I typically find good results with a PM), or tune the temperature of the softmax used for computing the output distribution of your RLHF model (to calibrate the probabilities, as in Kadavath et al. 2022)

Default Norms for Projects with Me

Below are the default norms I follow for most external-to-Anthropic projects I supervise (might be helpful for both new researchers and new researcher mentors e.g. for SERI MATS). I think they're reasonable defaults (especially if you're not sure where to start project norms or project management -wise).

The working style here is particularly tailored towards junior collaborators, so if you’re a more experienced collaborator (e.g., 2+ first or co-first author ML papers under your belt), feel free to work in a more independent manner than described below. Any of these are up for discussion e.g. if someone on the project thinks a different way of working together would work better, relative to what's described below.

Slack Updates: We’ll make a Slack channel for each project. Please put all messages that are project-relevant in the project’s slack channel (we’ll make one for each project), so that everyone on the project is on the same page about what’s going on and what everyone’s working on.
- ~Daily Updates: Aim to post an update every full day of work you’ve done (we can tone this down later in the project if it gets to be too much, but high bandwidth communication is generally quite helpful and worthwhile).
  - In your update, please include
    - Last: What you worked on last and how that went
    - Next: what you plan to work on next
    - Blockers: Any blockers, fear/uncertainty/doubt you have, or anything that’s slowing you down
  - I’d encourage others on the project to check in on the updates (and chime in with comments if there’s anything to say). I’ll also try to check in, though I might not always respond (or respond immediately); it’ll still be useful as a pulse check for me to skim the updates, as a quick way for me to see if there’s anything I can help with, in terms of unblocking you, or to check that you’re working on the most important stuff to be working on.
- Rationale: As a research supervisor, I find it helpful for my reports/mentees to post frequent updates (e.g., daily) on what they’re up to in the project’s slack channel. I might not always respond, but I’ll at least have more opportunity to catch if I think you're spending time on something that could be better spent elsewhere, and also generally have a better sense of if you're more blocked on something than I'd have anticipated (so I can either help you out or suggest that you triage some task). Otherwise, I’ll mainly just have our weekly meeting time to give feedback to save you time. It's not a huge deal if you don't want to, but I might be able to help you better if you post daily updates. You could post stuff like what your plan for the day is, what you're focusing on, etc. For people who are more senior and independent on what they’re working on, I’m happy to just stick to weekly meetings + adhoc Slack updates/DMs.
- Tag me in any messages (in this channel or in general) if you think I might be interested in seeing it, or if you’re interested in getting my take
- If there’s an update that requires some background context, then it’d be helpful to include the complete relevant context for understanding the experiment. This is helpful since I’m involved in a lot of projects, so I’ve often forgotten the specifics of the experiment that we last discussed. Also, there are often others who were not present in the discussion but would be interested in understanding the experiment, so providing the full context is helpful for understanding what’s going on. Often, missing 1-2 details on how the experiment was run is enough to not be able to interpret the experimental results (and then it might be another 12+ hours before I next check your clarification)
- I’m not always immediately responsive on Slack (e.g., may miss some messages, or take 1-7 days to respond), but being able to read through messages about what’s going on is still quite valuable (e.g., for knowing if you’re investing effort in something that’s unnecessary, or knowing where I can help unblock you)
- I’m always excited to chat live (e.g. over video chat or in person) if there’s something you’d like to get my feedback on, so feel free to book a meeting with me any time on my calendar when it’d be helpful to chat / if I’m not responsive enough (see below)
Live discussion: Feel free to book a time on my calendar here to chat, basically any time you think it'd be useful to chat (I share a calendly link with my collaborators). Most people under-book ad-hoc meetings. Please book a time to chat 1:1 every 4 weeks, even if there’s nothing to discuss. It’s helpful to have free-form time to discuss, to catch unknown unknowns in how the project is going and other issues that might not come up in a group discussion context.
Meeting Norms:

Please default to presenting results during weekly meetings as a slideshow (e.g., including all of the plots, tables, and text that you’d like to discuss). This helps to streamline the discussion significantly, minimize the amount of time that is needed to dig up some relevant results, and also helps to focus on the most relevant points where you’d like feedback from me. I personally also find it much easier to pay attention to than e.g. purely verbal discussion of results or results presented in a doc with a lot of text, though this probably varies a little by supervisor (a doc of results is probably the next best option, and can be fine if you prefer).
It helps to have a concrete agenda for what to discuss during meetings (can take as little as 5-10 minutes to think about before a meeting, or up to e.g. 1 hour if it involves more explanation/plotting/etc. and the meeting is short / there’s a lot to cover). In particular, it helps to have:
1. Plots, tables, or other concrete results to show (e.g., ideally in a notion or google doc), so there are specific things to discuss, point at, and look at. Concrete results help to ground the discussion and give high enough information to people who are supervising the project to get useful feedback, notice things that you might not have, etc. — helpful for catching things that talking over results out loud / at a high-level wouldn’t catch
2. A concrete list of proposed, prioritized next steps, given your existing results and the overall project goal. Even if you’re unsure of what to do next, it’s easier for others at the meeting to discriminate (rank a proposed list of steps) than to generate (ideas for what next steps to take), especially given that they don’t have as much context as you do on the project
3. A concrete list of questions and places where it’d be helpful to get input from others at the meeting
4. A sense of how long to spend discussing each point above, so that everything gets discussed without running over time. It’s common to spend disproportionately more time on points brought up early on in a meeting, since there’s a feeling of more time, which leaves much less time (or no time) for other important discussion points. Alternatively, bring the most important points up for discussion at the start of a meeting.
I (and I think many other research supervisors) generally leave it to the main people actively working on the project / running experiments to determine what to discuss and present, since they have the most context on what needs to be discussed. I’ll sometimes come with specific questions to the meeting, but by default I won’t.
If you have a conflict for a meeting (like a 1:1), please try to cancel or reschedule the calendar invite in advance, as far in advance as you can. For meetings that are in the morning in my time zone, please try to cancel the previous day (so I can get some extra sleep :) ) – otherwise, it’s pretty low-cost to cancel a meeting last-minute (since I’ll just get the time back).
Feel free to book a time on my calendar any time you think there’s something that’d be helpful to chat about (feel free to book liberally, and I’ll let you know if the frequency is too high – that’s basically never happened though, and the usual failure mode is that people underbook meetings, especially 1:1s)
General rules for scheduling meetings/invites:
1. Make them modifiable by anyone on the invite
2. Have a Google Meet link attached
3. Have a specific where-to-meet location (e.g. booking a physical room if we're in the same office space)
4. Get Google Meet premium so meetings don't end at 1h
5. "picture-in-picture mode" with google meet lets you see people's faces while sharing your screen / being on other tabs, which is helpful for being able to see people's reactions while presenting results / sharing screen
6. Generally get comfortable with different screen sharing options, e.g., sharing tab-only, full window, joining in presentation mode, etc., so that if there are any hiccups, you can quickly fix any issues that come up
7. (A bunch of the above can be automatic/defaults, so it can be a one-time change)

Meeting with Collaborators: When you’re full-time on the project, you should probably be having at least one other meeting with other collaborator(s) per week to sync and get feedback (outside of the full group meeting where I’ll be there). It’s not strictly necessary (and will depend on the preferences of other collaborators on the project), but it will probably help ensure you’re getting the guidance you need on the project.
Overcommunication: Overcommunicating with me is really helpful, especially if I’m supervising you on a project. Otherwise, I’ll try to spend a lot of effort (often unsuccessfully) trying to figure out whether you’re happy, where you’re stuck, how I can help, etc. I think I’m good at helping to come up with solutions or unblock people if I know something’s going wrong, but find doing that hard when I have to guess at what some of the potential problems are. If you think there’s something I can improve at, communicating with me ASAP about that would be super helpful and appreciated (so I can improve our working relationship ASAP — usually other people have the same thing in mind as you do too)
Feedback: Please absolutely give me (or other project supervisors) feedback ASAP if you feel that anything is suboptimal:
- You feel like you’re struggling to do your best work for whatever reason (e.g., not enough collaboration, having some trouble debugging some kind of issue, feeling stuck on conceptual issues, etc.)
- You’re having issues with a collaborator
- You think I could be supporting you better in some way
- Rationale: Many project supervisors are supervising many projects at once (or working on their own projects full-time). Especially if we don't have a regular 1:1, it’ll be crucial for you to bring up issues during the weekly group meeting (where it’s an issue that can be discussed in the context of others) or otherwise privately (over DM or booking a 1:1 time on my calendar). This will be the primary way we can nip issues in the bud to help you have a great experience, and it relies on (over)communication on your part.
Direct Messages: I’d highly encourage you to DM me if anything ever comes up that’s time sensitive or private that is at all worth discussing
Taking Agency: I’d highly encourage you to take agency and organize whatever you think will be helpful for you (coworking, discussion groups, standups, etc.)
Other great advice on how to run research meetings
Learning from Mentorship: Mastery is a great book for understanding how to learn from mentors. Close mentorship is maybe the fastest path to become an expert in a domain, and there's a lot of evidence to support this. The main thing that's important is building a strong mental model of what your mentor would say (e.g., next steps on a project, feedback on your results, what they'd critique).

4 comments

Comments sorted by top scores.

comment by Henry Sleight (ResentHighly) · 2024-02-29T18:01:18.780Z · LW(p) · GW(p)

First off: as one of Ethan's current Astra Fellows (and having worked with him since ~last October) I especially think his collaborators in MATS and Astra historically underweight how valuable overcommunicating with Ethan is, and routinely underbook meetings to ask for his support.

Second, I think this post is so dense with useful advice, so I made anki flashcards of Ethan's post using GPT-4 (generated via ankibrain [https://ankiweb.net/shared/info/1915225457] , small manual edits.)

You can find them here: https://drive.google.com/file/d/1G4i7iZbILwAiQ7FtasSoLx5g7JIOWgeD/view?usp=sharing

comment by qxcv · 2024-03-02T00:35:59.640Z · LW(p) · GW(p)

For highly empirical research, it’s critical to get quick feedback and iterate on ideas rapidly. Jacob Steinhardt has a great blog post describing that a really good strategy for doing research is to “reduce uncertainty at the fastest possible rate”

Michael Bernstein's slides on velocity are a great resource for learning this mindset this as well. I particularly like his metaphor of the "swamp". This is the place you get stuck when you really want technique X to work for the project to progress, but none of the ways that you've tried applying it have succeeded. The solution is to have high velocity: that is, to test out as many ideas as possible per unit time until you get out the swamp. Other highlights of the slide deck include the focus on answering questions rather than doing engineering, and the related core-periphery distinction between things that are strictly needed to answer a question & those that can be ignored/mocked up/replaced for testing (which echoes the ideas in the "workflow" section of this post).

(Although they're similar, I'd argue that Michael's approach is easier to apply to empirical alignment research than Jacob's "stochastic decision process" approach. That's because falsifying abstract research ideas in empirical deep learning is hard (impossible?), and you don't get much generalizable knowledge from failing to get one idea to work. The real aim is to find one deep insight that does generalize—hence the focus on trying many distinct approaches.)

Replies from: ethan-perez

↑ comment by Ethan Perez (ethan-perez) · 2024-03-03T04:38:14.416Z · LW(p) · GW(p)

Yeah, I think this is one of the ways that velocity is really helpful. I'd probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there's much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you've probably run through the low-hanging fruit on that problem/approach, and there's other low-hanging in other areas that you probably want to be picking instead.

(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it's useful to have either way)

comment by Review Bot · 2024-03-01T21:10:05.670Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Tips for Empirical Alignment Research

Contents

What success generally looks like

Tactical Research Tips & Approach

Workflow

Reading research papers:

General Mindset Tips

Three modes of research

Work Habits

Taking health seriously

Machine Learning/Engineering Footguns

Default Norms for Projects with Me

4 comments