Tips and Code for Empirical Research Workflows

post by John Hughes (john-hughes), Ethan Perez (ethan-perez) · 2025-01-20T22:31:51.498Z · LW · GW · 6 comments

Contents

  Quick Summary
  Part 1: Workflow Tips
    Terminal
    Integrated Development Environment (IDE)
    Git, GitHub and Pre-Commit Hooks
  Part 2: Useful Tools
    Software/Subscriptions
    LLM Tools
    LLM Providers
    Command Line and Python Packages
  Part 3: Experiment Tips
    De-risk and extended project mode
    Tips for both modes 
    Tips for extended project mode
  Part 4: Shared AI Safety Tooling Repositories
    Repo 1: safety-tooling
    Repo 2: safety-examples
  Acknowledgements
None
9 comments

Our research is centered on empirical research with LLMs. If you are conducting similar research, these tips and tools may help streamline your workflow and increase experiment velocity. We are also releasing two repositories to promote sharing more tooling within the AI safety community.

John Hughes is an independent alignment researcher working with Ethan Perez and was a MATS mentee in the Summer of 2023. In Ethan's previous writeup [LW · GW] on research tips, he explains the criteria that strong collaborators often have, and he puts 70% weight on "getting ideas to work quickly." Part of being able to do this is knowing what tools there are at your disposal.

This post, written primarily by John, shares the tools and principles we both use to increase our experimental velocity. Many readers will already know much of this, but we wanted to be comprehensive, so it is a good resource for new researchers (e.g., those starting MATS). If you are a well-versed experimentalist, we recommend checking out the tools in Part 2—you might find some new ones to add to your toolkit. We're also excited to learn from the community, so please feel free to share what works for you in the comments!

Quick Summary

Part 1: Workflow Tips

Terminal

Efficient terminal navigation is essential for productivity, especially when working on tasks like running API inference jobs or GPU fine-tuning on remote machines. Managing directories, editing files, or handling your Git repository can feel tedious when relying solely on bash commands in a standard terminal. Here are some ways to make working in the terminal more intuitive and efficient.

Note: there are many recommendations here, which can be overwhelming, but all of this is automated in John's dotfiles (including installing zsh and tmux, changing key repeat speeds on Mac and setting up aliases). So, if you'd like to get going quickly, we recommend following the README to install and deploy this configuration.

Integrated Development Environment (IDE)

Choosing the right Integrated Development Environment (IDE) can enhance your productivity, especially when using LLM coding assistants. A good IDE simplifies code navigation, debugging, and version control.

Git, GitHub and Pre-Commit Hooks

Mastering Git, GitHub, and pre-commit hooks is key to maintaining a smooth and reliable workflow. These tools help you manage version control, collaborate effectively, and automate code quality checks to prevent errors before they happen.

Part 2: Useful Tools

Not all of these recommendations are directly related to research (e.g., time-tracking apps), but they are excellent productivity tools worth knowing about. The goal of this list is to make you aware of what’s available—not to encourage you to adopt all of these tools at once, but to provide options you can explore and incorporate as needed.

Software/Subscriptions

LLM Tools

LLM Providers

Command Line and Python Packages

Part 3: Experiment Tips

De-risk and extended project mode

First, we'd like to explain that there are usually two modes that a research project is in: namely, de-risk mode and extended project mode. These modes significantly change how you should approach experiments, coding style, and project management.

  1. De-risk mode focuses on rapidly answering high-priority questions with minimal overhead.
    1. This mode is ideal for:
      1. Quick experimentation using Python notebooks that minimize time-to-insight.
      2. Minimal investment to avoid effort in engineering practices, like extensive documentation, strict coding standards, or generalized pipelines.
    2. In collaborative group settings, this mode is still common. It is important to communicate the experiment's goals and frequently discuss the next steps rather than performing thorough code reviews.
  2. Extended project mode emphasizes engineering rigour and longer-term maintainability.
    1. This mode is especially critical for longer-term collaborations or experiments that require significant compute resources and complicated infrastructure, where bugs can lead to costly reruns. It also ensures that knowledge and progress can be shared across contributors.
    2. Key practices in extended project mode include:
      1. Transitioning from notebooks to structured scripts, modules, or pipelines.
      2. Applying code reviews, testing, and version control.
      3. Using tools like pre-commit hooks and CI/CD workflows to enforce quality.

The workflow should always be conditioned on the situation:

Ethan tends to be in de-risk mode for 75% of his work, and he uses Python notebooks to explore ideas (for example, many-shot jailbreaking was derisked in a notebook with ~50 lines of code). The Alignment Science team at Anthropic is also primarily in "de-risk mode" for initial alignment experiments and sometimes switches to "Extended project mode" for larger, sustained efforts.

Note: Apollo defines these modes similarly as "individual sprint mode" and "standard mode" in their Engineering Guide. We opt for different names since lots of the research we are involved with can primarily be in de-risk mode for a long period of time.

Tips for both modes 

Tips for extended project mode

Part 4: Shared AI Safety Tooling Repositories

For many early-career researchers, there's an unnecessarily steep learning curve for even figuring out what good norms for their research code should look like in the first place. We're all for people learning and trying things for themselves, but we think it would be great to have the option to do that on top of a solid foundation that has been proven to work for others. That's why things like e.g. the ARENA curriculum are so valuable.

However, there aren't standardised templates/repos for most of the work in empirical alignment research. We think this probably slows down new researchers a lot, requiring them to unnecessarily duplicate work and make decisions that they might not notice are slowing them down. ML research, in general, involves so much tinkering and figuring things out that building from a strong template can be a meaningful speedup and provide a helpful initial learning experience.

For the MATS 7 scholars mentored by Ethan, Jan, Fabien, Mrinank, and others from the Anthropic Alignment Science team, we have created a GitHub organization called safety-research to allow everyone to easily discover and benefit from each others’ code. We are piloting using two repositories: 1) for shared tooling such as inference and fine-tuning tools and 2) providing a template repo to clone at the start of a project that has examples of using the shared tooling. We are open-sourcing these two repositories and would love for others to join us!

Repo 1: safety-tooling

Repo 2: safety-examples

Note: We are very excited about UK AISI's Inspect framework, which also implements lots of what is in safety-tooling and much more (such as tool usage and extensive model graded evaluations). We love the VSCode extension for inspecting log files and the terminal viewer for experiment progress across models and tasks. We aim to build a bigger portfolio of research projects that use Inspect within safety-examples and build more useful research tools that Inspect doesn't support in safety-tooling.

Acknowledgements

We'd like to thank Jack Youstra and Daniel Paleka, as many of the useful tool suggestions stem from conversations with them. For more of their recommendations, check out their blogs here and here. John would like to thank Ed Rees and others at Speechmatics, from whom he's borrowed and adapted dotfiles functionality over the years. Thanks to Sara Price, James Chua, Henry Sleight and Dan Valentine for providing feedback on this post.  

6 comments

Comments sorted by top scores.

comment by Neel Nanda (neel-nanda-1) · 2025-01-21T01:00:20.974Z · LW(p) · GW(p)

This looks extremely comprehensive and useful, thanks a lot for writing it! Some of my favourite tips (like clipboard managers and rectangle) were included, which is always a good sign. And I strongly agree with "Cursor/LLM-assisted coding is basically mandatory".

I passed this on to my mentees - not all of this transfers to mech interp, in particular the time between experiments is often much shorter (eg a few minutes, or even seconds) and often almost an entire project is in de-risking mode, but much of it transfers. And the ability to get shit done fast is super important

Replies from: john-hughes
comment by John Hughes (john-hughes) · 2025-01-23T19:47:40.741Z · LW(p) · GW(p)

Thanks Neel! I'm glad you found it helpful. If you or your scholars recommend any other tools not mentioned in the post, I'd be interested to hear more.

Replies from: neel-nanda-1
comment by Neel Nanda (neel-nanda-1) · 2025-01-24T10:58:57.047Z · LW(p) · GW(p)

I've been really enjoying voice to text + LLMs recently, via a great Mac App called Super Whisper (which can work with local speech to text models, so could also possibly be used for confidential stuff) - combining Super Whisper and Claude and Cursor means I can just vaguely ramble at my laptop about what experiments should happen and they happen, it's magical!

comment by jan betley (jan-betley) · 2025-01-21T08:03:03.860Z · LW(p) · GW(p)

I have one question:

asyncio is very important to learn for empirical LLM research since it usually involves many concurrent API calls

I've lots of asyncio experience, but I've never seen a reason to use it for concurrent API calls, because concurrent.futures, especially ThreadPoolExecutor, work as well for concurrent API calls and are more convenient than asyncio (you don't need await, you don't need the loop etc).

Am I missing something? Or is this just a matter of taste?

Replies from: isaac-dunn, john-hughes
comment by Isaac Dunn (isaac-dunn) · 2025-01-21T20:43:19.013Z · LW(p) · GW(p)

I recently switched from using threads to using asyncio, even though I had never used asyncio before.

It was a combination of:

  • Me using cheaper "batch" LLM API calls, which can take hours to return a result
  • Therefore wanting to run many thousands of tasks in parallel from within one program (to make up for the slow sequential speed of each task)
  • But at some point, the thread pool raised a generic "can't start a new thread" exception, without giving too much more information. It must have hit a limit somewhere (memory? hardcoded thread limit?), although I couldn't work out where.

Maybe the general point is that threads have more overhead, and if you're doing many thousands of things in parallel, asyncio can handle it more reliably.

comment by John Hughes (john-hughes) · 2025-01-22T10:34:11.731Z · LW(p) · GW(p)

Threads are managed by the OS and each thread has an overhead in starting up/switching. The asyncio coroutines are more lightweight since they are managed within the Python runtime (rather than OS) and share the memory within the main thread. This allows you to use tens of thousands of async coroutines, which isn't possible with threads AFAIK. So I recommend asyncio for LLM API calls since often, in my experience, I need to scale up to thousands of concurrents. In my opinion, learning about asyncio is a very high ROI for empirical research.

comment by saraprice · 2025-01-10T23:08:42.632Z · LW(p) · GW(p)
comment by saraprice · 2025-01-10T23:04:08.582Z · LW(p) · GW(p)
comment by saraprice · 2025-01-10T23:03:02.787Z · LW(p) · GW(p)