luciaquirke's Shortform

post by luciaquirke · 2023-10-05T11:32:14.145Z · LW · GW · 3 comments

3 comments

Comments sorted by top scores.

comment by luciaquirke · 2023-10-05T11:32:14.238Z · LW(p) · GW(p)

Working on Remote Machines

Unless you have access to your own GPU or a private cluster, you will probably want to rent a machine. The best guide to getting set up with Vast.AI, the cheapest and (close to) the most reliable provider, is James Dao's doc: https://docs.google.com/document/d/18-v93_lH3gQWE_Tp9Ja1_XaOkKyWZZKmVBPKGSnLuL8/edit?usp=sharing

In addition,

  • Vast.AI is not always available, be prepared to switch providers
  • Fast setup is useful, try a custom Docker image

Providers

The main providers are Vast.AI, RunPod, Lambda Labs, and newcomer TensorDock. Occasionally someone will rent out all the top machines over multiple providers, so be mentally prepared to switch between providers. 

Lambda Labs has persistent storage in some regions, the most reliable machines, and the best UX. Their biggest issues is that they're more expensive and have limited availability - it's common for all the machines to be rented out. LambdaLabs is also the only service that doesn't let you add credit beforehand. Instead, it takes your card then charges your bank account as you use compute, which is a bit scary because it’s easy to leave a machine running overnight.

RunPod also has persistent storage and nice UX. Their biggest issue is that they don't document the many, many rough edges of the service. Notably:

  • Their Community Cloud machines often suffer from extremely slow network speeds, but their Secure Cloud machines are more reliable. 
  • They seem to bake necessary functionality into their Docker images such that custom images don't work without tinkering (https://github.com/runpod/containers/blob/main/official-templates/pytorch/Dockerfile).
  • They don't let you specify a startup script. Their instance configuration lets you specify a "Docker command" which is a literal Docker CMD instruction except you have to exclude the CMD bit at the start.
  • Their provided Docker images are missing basic packages (e.g. no rsync)

Vast.AI uses tmux for the terminal which is not user friendly, but lets you close the SSH tunnel without killing your processes. I recommend keeping a cheat sheet handy. The most important command is Ctrl-b [ to enable scrolling.

Setup

You generally lose your files when you stop your machine so you need to be able to set up from scratch quickly. To enable this most providers let you specify a Docker image and an ‘on startup’ script to run by default when you rent out a machine.

I use my on startup script to clone git repositories and store credentials but I haven’t gotten it to work for installing packages (possibly fixable problem idk).

One way to get around this is to copy and paste a package installation script into the terminal every morning. This is annoying but works well for most packages, but PyTorch/TensorFlow is too large and complex, so select a Docker image with the machine learning framework you use installed.

To avoid running the package installation script completely,  pre-install all your packages into a Docker image then set that as the default image in your compute provider. My Dockerfile is just the PyTorch Docker image plus a few packages specific to mechanistic interpretability.

Dockerfile

FROM pytorch/pytorch:latest

RUN apt-get update && apt-get install -y git

RUN pip install tqdm einops seaborn plotly-express kaleido \
    scikit-learn torchmetrics ipykernel ipywidgets nbformat \
    git+https://github.com/neelnanda-io/TransformerLens \    git+https://github.com/callummcdougall/CircuitsVis.git#subdirectory=python \
    git+https://github.com/neelnanda-io/neelutils.git \
    git+https://github.com/neelnanda-io/neel-plotly.git

Startup script

runuser -l root -c 'export GITHUB_USER=<username> GITHUB_PAT=<PAT> GIT_NAME=<name> GIT_EMAIL=<email>; git config --global user.name <username>; git config --global user.email <email>; git clone https://$GITHUB_USER:$GITHUB_PAT@github.com/<GitHub repo>.git; cd <GitHub repo>; exec bash; conda init;’

IDE

I found JetBrains' support for remote machines lacking - the debugger doesn't work over SSH and the IDE is slow to download and boot up. I use VSCode instead, which has a debugger so slow it's almost useless but downloads quickly and has a setting where you can specify extensions for automatic download on remote machines which works smoothly.

Replies from: crabman
comment by philip_b (crabman) · 2023-10-12T05:45:00.562Z · LW(p) · GW(p)

You says Vast.AI is the "most reliable provider". In my experience, it's an unreliable mess with sometimes buggy not properly working servers and non-existent support service. I will also say the same about runpod.io. On the other hand, lambdalabs had been very reliable in my experience and has a much better UX. The main problem with LambdaLabs is that nowadays it happens pretty often that it has no available servers.

Replies from: luciaquirke
comment by luciaquirke · 2023-10-17T06:26:34.794Z · LW(p) · GW(p)

Thanks for the comment, fair point! I found Vast.AI to be frustratingly unreliable when I started using it but it seems to have improved over the last three months, to the point where it feels comparable to (how I remember) LambdaLabs. LambdaLabs definitely has the best UI/UX though. I've amended the post to clarify.

I've had one great and one average experience with RunPod customer service, but haven't interacted with anyone from the other services.