Posts
Comments
Anecdote from the NYC rationalist (OBNYC) group: Something I think we'd want that other groups might want too is an easy way for people organizing meetups to post to multiple channels like a website, mailing list, LessWrong and meetup.com.
Another issue we have that others may have too is that we tend to host meetups at people's apartments, and the people hosting don't necessarily want their addresses to be posted publicly. We currently handle this by only posting the address on a Google Group which is configured so you have to "apply" with a text box, and then we basically accept every application that seems like a reasonable human or mentions how they found the group. But Google doesn't give us any way to say what to put in the box, or make the box less intimidating. I know when I first visited NYC the "application" almost intimidated me out of joining, and made me more hesitant to show up to some person's apartment in case I was intruding on a social group I wasn't really welcome in. I imagine a lower-friction and more welcoming way to put up a small roadblock to seeing the address would help recruiting.
You mention having a second office in "the city proper": Would that be referring to Bellingham and Peekskill or Seattle and NYC? Alternatively would working from home some days of the week be viable for many employees?
I ask this because to me these would make the difference for the viability of living mainly in Seattle/NYC and spending 3 days a week at the campus, as opposed to the reverse case of living mainly near the campus and going into the city on weekends.
This isn't a huge difference from the perspective of doing things on weekends, but it makes a difference for having a significant other who lives in the city and going to meetups. It means if you want to live with someone who has to commute from the city you get to spend 5/7 evenings a week with them instead of 3/7, which to me seems like a pretty big difference, and I suspect would seem like a big difference to prospective partners as well. I also find weekly meetups like OBNYC to be a great foundation for a social life, and taking the train both ways for them would be a bit much. So given OBNYC Meetups are on Tuesdays, any MIRI employees living in NYC that spent Wed/Thurs/Fri on campus would be free to attend them.
Personally, dating and socializing concerns mean that I'd find the "live in NYC, spend 3 days a week on a campus in nature" option rather appealing, but the "live on campus, spend weekends at partner's place or hotel in NYC" much less appealing.
I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I noticed were better normalization, label smoothing, a somewhat tweaked input pipeline (not sure if optimization or refactoring) and updating Tensorflow a few versions (plausibly includes a bunch of hardware optimizations like you're talking about).
The things they took from fast.ai for the 2x speedup were training on progressively larger image sizes, and the better triangular learning rate schedule. Separately for their later submissions, which don't include a single-GPU figure, fast.ai came up with better methods of cropping and augmentation that improve accuracy. I don't necessarily think the 2x speedup pace through clever ideas pace is sustainable, lots of the fast.ai ideas seem to be pretty low hanging fruit.
I basically agree with the quoted part of your take, just that I don't think it explains enough of the apathy towards training speed that I see, although I think it might more fully explain the situation at OpenAI and DeepMind. I'm making more of a revealed preferences efficient markets kind of argument where I think the fact that those low hanging fruits weren't picked and aren't incorporated into the vast majority of deep learning projects suggests that researchers are sufficiently un-constrained by training times that it isn't worth their time to optimize things.
Like I say in the article though, I'm not super confident and I could be underestimating the zeal for faster training because of sampling error of what I've seen, read and thought of, or it could just be inefficient markets.
A relevant paper came out 3 days ago talking about how AlphaGo used Bayesian hyperparameter optimization and how that improved performance: https://arxiv.org/pdf/1812.06855v1.pdf
It's interesting to set the OpenAI compute article's graph to linear scale so you can see that the compute that went into AlphaGo utterly dwarfs everything else. It seems like DeepMind is definitely ahead of nearly everyone else on the engineering effort and money they've put into scaling.
I just checked and seems it was fp32. I agree this makes it less impressive, I forgot to check that originally. I still think this somewhat counts as a software win, because working fp16 training required a bunch of programmer effort to take advantage of the hardware, just like optimization to make better use of cache would.
However, there's also a different set of same-machine datapoints available in the benchmark, where training time on a single Cloud TPU v2 went down from 12 hours 30 minutes to 2 hours 44 minutes, which is a 4.5x speedup similar to the 5x achieved on the V100. The Cloud TPU was special-purpose hardware being trained with bfloat16 from the start, so that's a similar magnitude improvement more clearly due to software. The history shows incremental progress down to 6 hours and then a 2x speedup once the fast.ai team published and the Google Brain team incorporated their techniques.