Posts
Comments
It's really useful to ask the simple question "what tests could have caught the most costly bugs we've had?"
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like "a bug in our vendor's code caused them to send us numbers denominated in pounds instead of dollars". This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.
There was a serious bug in this post that invalidated the results, so I took it down for a while. The bug has now been fixed and the posted results should be correct.
One sort-of counterexample would be The Unreasonable Effectiveness of Mathematics in the Natural Sciences, where a lot of Math has been surprisingly accurate even when the assumptions where violated.
The Mathematical Theory of Communication by Shannon and Weaver. It's an extended version of Shannon's original paper that established Information Theory, with some extra explanations and background. 144 pages.
Atiyah & McDonald's Introduction to Commutative Algebra fits. It's 125 pages long, and it's possible to do all the exercises in 2-3 weeks – I did them over winter break in preparation for a course.
Lang's Algebra and Eisenbud's Commutative Algebra are both supersets of Atiyah & McDonald, I've studied each of those as well and thought A&M was significantly better.
Unfortunately, I think it isn't very compatible with the way management works at most companies. Normally there's pressure to get your tickets done quickly, which leaves less time for "refactor as you go".
I've heard this a lot, but I've worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?
A big piece is that companies are extremely siloed by default. It's pretty easy for a team to improve things in their silo, it's significantly harder to improve something that requires two teams, it's nearly impossible to reach beyond that.
Uber is particularly siloed, they have a huge number of microservices with small teams, at least according to their engineering talks on youtube. Address validation is probably a separate service from anything related to maps, which in turn is separate from contacts.
Because of silos, companies have to make an extraordinary effort to actually end up with good UX. Apple was an example of these, where it was literally driven by the founder & CEO of the company. Tumblr was known for this as well. But from what I heard, Travis was more of a logistics person than a UX person, etc.
(I don't think silos explain the bank validation issue)
Cooking:
- Smelling ingredients & food is a good way to develop intuition about how things will taste when combined
- Salt early is generally much better than salt late
Data Science:
- Interactive environments like Jupyter notebooks are a huge productivity win, even with their disadvantages
- Automatic code reloading makes Jupyter much more productive (e.g. autoreload for Python, or Revise for Julia)
- Bootstrapping gives you fast, accurate statistics in a lot of areas without needing to be too precise about theory
Programming:
- Do everything in a virtual environment or the equivalent for your language. Even if you use literally one environment on your machine, the tooling around these tends to be much better
- Have some form of reasonably accurate, reasonably fast feedback loop(s). Types, tests, whatever – the best choice depends a lot on the problem domain. But the worst default is no feedback loop
Ping-pong:
- People adapt to your style very rapidly, even within a single game. Learn 2-3 complementary styles and switch them up when somebody gets used to one
Friendship:
- Set up easy, default ways to interact with your friends, such as getting weekly coffees, making it easy for them to visit, hosting board game nights etc.
- Take notes on what your friends like
- When your friends have persistent problems, take notes on what they've tried. When you hear something they haven't tried, recommend it. This is both practical and the fact that you've customized it is generally appreciated
Conversations:
- Realize that small amounts of awkwardness, silence etc. are generally not a problem. I was implicitly following a strategy that tried to absolutely minimize awkwardness for a long time, which was a bad idea
- using vector syntax is much faster than loops in Python
To generalize this slightly, using Python to call C/C++ is generally much faster than pure Python. For example, built-in operations in Pandas tend to be pretty fast, while using .apply()
is usually pretty slow.
I didn't know about that, thanks!
I found Loop Hero much better with higher speed, which you can fix by modifying a variables.ini
file: https://www.pcinvasion.com/loop-hero-speed-mod/
I've used Optim.jl
for similar problems with good results, here's an example: https://julianlsolvers.github.io/Optim.jl/stable/#user/minimization/
The general lesson is that "magic" interfaces which try to 'do what I mean' are nice to work with at the top-level, but it's a lot easier to reason about composing primitives if they're all super-strict.
100% agree. In general I usually aim to have a thin boundary layer that does validation and converts everything to nice types/data structures, and then a much stricter core of inner functionality. Part of the reason I chose to write about this example is because it's very different from what I normally do.
Important caveat for the pass-through approach: if any of your
build_dataset()
functions accept**kwargs
, you have to be very careful about how they're handled to preserve the property that "calling a function with unused arguments is an error". It was a lot of work to clean this up in Matplotlib...
To make the pass-through approach work, the build_dataset
functions do accept excess parameters and throw them away. That's definitely a cost. The easiest way to handle it is to have the build_dataset
functions themselves just pass the actually needed arguments to a stricter, core function, e.g.:
def build_dataset(a, b, **kwargs):
build_dataset_strict(a, b)
build_dataset(**parameters) # Succeeds as long as keys named "a" and "b" are in parameters
This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create.
This is true. We're using AWS Batch because it's the best tool we could find for other jobs that actually do need hundreds/thousands of spot instances, and this particular job goes in the middle of those. If most of our jobs looked like this one, using Batch wouldn't make sense.
You get language-level validation either way. The
assert
statements are superfluous in that sense. What they do add is in effectcheck_dataset_params()
, whose logic probably doesn't belong in this file.
You're right. In the explicit example, it makes more sense to have that sort of logic at the call site.
The reason to be explicit is to be able to handle control flow.
The datasets aren't dependent on each other, though some of them use the same input parameters.
If your jobs are independent, then they should be scheduled as such. This allows jobs to run in parallel.
Sure, there's some benefit to breaking down jobs even further. There's also overhead to spinning up workers. Each of these functions takes ~30s to run, so it ends up being more efficient to put them in one job instead of multiple.
Your errors would come out just as fast if you ran
check_dataset_params()
up front.
So then you have to maintain check_dataset_params
, which gives you a level of indirection. I don't think this is likely to be much less error-prone.
The benefit of the pass-through approach is that it uses language-level features to do the validation – you simply check whether the parameters dict has keywords for each argument the function is expecting.
A good way to increase feedback rate is to write better tests.
I agree in general, but I don't think there are particularly good ways to test this without introducing indirection.
Failure in production should be the exception, not the norm.
The failure you're talking about here is tripping a try
clause. I agree that exceptions aren't the best control flow – I would prefer if the pattern I'm talking about could be implemented with if statements – but it's not really a major failure, and (unfortunately) a pretty common pattern in Python.
"refine definite theories"
Where does this quote come from – is it in the book?
Is there a reason you recommend Hy instead of Clojure? I would suggest Clojure to most people interested in Lisp these days, due to the overwhelmingly larger community, ecosystem, & existence of Clojurescript.
Ah, that's a great example, thanks for spelling it out.
This is sometimes true in functional programming, but only if you're careful.
I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.
For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.
As I understand, this system is mostly useful if you're using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function's source code changed.
Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables.
Agreed.
So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.
I agree that it's essentially a framework, and you'd need buy-in from a team in order to consistently use it in a repository. But I've seen teams buy into heavier frameworks pretty regularly; this version seems unusual but not particularly hard to use/understand. It's worth noting that bad caching systems are pretty common in data science, so something like this is potentially a big improvement there.
This is very cool. The focus on caching a code block instead of just the inputs to the function makes it significantly more stable, since your cache will be automatically invalidated if you change the code in any way.
If you're using non-modal editing, in that example you could press Alt+rightarrow three times, use cmd+f, the end key (and back one word), or cmd+righarrow (and back one word). That's not even counting shortcuts specific to another IDE or editor. Why, in your mental model, does the non-modal version feel like fewer choices? I suspect it's just familiarity – you've settled on some options you use the most, rather than trying to calculate the optimum fewest keystrokes each time.
Have you ever seen an experienced vim user? 3-5 seconds latency is completely unrealistic. It sounds to me like you're describing the experience of being someone who's a beginner at vim and spent half their life into non-modal editing, and in that case, of course you're going to be much faster with the second. And to be fair, vim is extremely beginner-unfriendly in ways that are bad and could be fixed without harming experts – kakoune(https://kakoune.org/) is similar but vastly better designed for learning.
As a side note, this is my last post in this conversation. I feel like we have mostly been repeating the same points and going nowhere.
I ended up using cmd+shift+i which opens the find/replace panel with the default set to backwards.
So, one of the arguments you've made at several points is that we should expect Vim to be slower because it has more choices. This seems incorrect to me, even a simple editor like Sublime Text has about a thousand keyboard shortcuts, which are mostly ad-hoc and need to be memorized separately. In contrast Vim has a small, (mostly) composable language. I just counted lsusr's post, and it has fewer than 30 distinct components – most of the text is showing different ways to combine them.
The other thing to consider is that most programmers will use at least a dozen editors/IDEs in their careers. I have 5 open on my laptop right now, and it's not because I want to! Vim provides a unified set of key bindings among practically every editor, which normally have very different ways of doing things.
So that's on the order of a 10x-100x order of magnitude reduction in vocabulary size, which should at least make you consider the idea that Vim has lower latency.
I did :Tutor on neovim and only did commands that actually involved editing text, it took 5:46.
Now trying in Sublime Text. Edit: 8:38 in Sublime, without vim mode – a big difference! It felt like it was mostly uniform, but one area where I was significantly slower was search and replace, because I couldn't figure out how to go backwards easily.
This is a great experiment, I'll try it out too. I also have pretty decent habits for non-vim editing so it'll be interesting to see.
Some IDEs are just very accommodating about this, e.g. PyCharm. So that's great.
Some of them aren't, like VS Code. For those, I just manually reconfigure the clashing key bindings. It's annoying, but it only takes ~15 minutes total.
I would expect using VIM to increase latency. While you are going to press fewer keys you are likely going to take slightly longer to press the keys as using any key is more complex.
This really isn't my experience. Once you've practiced something enough that it becomes a habit, the latency is significantly lower. Anecdotally, I've pretty consistently seen people who're used to vim accomplish text editing tasks much faster than people who aren't, unless the latter is an expert in keyboard shortcuts of another editor such as emacs.
There's the paradox of choice and having more choices to accomplish a task costs mental resources. Vim forces me to spent cognitive resources to chose between different alternatives of how to accomplish a task.
All the professional UX people seem to advocate making interfaces as simple as possible.
You want simple interfaces for beginners. Interfaces popular among professionals tend to be pretty complex, see e.g. Bloomberg Terminal or Photoshop or even Microsoft Excel.
As far as I know there's almost no measurement of productivity of developer tools. Without data, I think there are two main categories in which editor features, including keyboard shortcuts, can make you more productive:
- By making difficult tasks medium to easy
- By making ~10s tasks take ~1s
An example of the first would be automatically syncing your code to a remote development instance. An example of the first would be adding a comma to the end of several lines at once using a macro. IDEs tend to focus on 1, text editors tend to focus on 2.
In general, I think it's very likely that the first case makes you more productive. What about the second?
My recollection is that in studies of how humans respond to feedback, there are large differences between even relatively short periods of latency. Something like vim gives you hundreds of these (learning another editor's keyboard shortcuts very well probably does too.) I can point to dozens of little things that are easier with vim, conversely, nothing is harder because you can always just drop into insert mode.
I agree that this isn't nearly as convincing as actual studies would be, but constructing a reasonable study on this seems pretty difficult.
Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.
Yeah, SIMD is a significant point I forgot to mention.
It's a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard! StructArrays.jl
makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." – Donald Knuth
Yup, these are all reasons to prefer column orientation over row orientation for analytics workloads. In my opinion data locality trumps everything but compression and fast transmission is definitely very nice.
Until recently, numpy and pandas were row oriented, and this was a major bottleneck. A lot of pandas's strange API is apparently due to working around row orientation. See e.g. this article by Wes McKinney, creator of pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/#:~:text=Arrow's%20C%2B%2B%20implementation%20provides%20essential,optimized%20for%20analytical%20processing%20performance
I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:
function my_double_sum(data)
sum_heights_and_weights = 0
for row in data
sum_heights_and_weights += row.weight + row.height
end
return sum_heights_and_weights
end
@btime(my_double_sum(heights_and_weights))
> 50.342 ms (1 allocation: 16 bytes)
function my_double_sum2(heights, weights)
sum_heights_and_weights = 0
for (height, weight) in zip(heights, weights)
sum_heights_and_weights += height + weight
end
return sum_heights_and_weights
end
@btime(my_double_sum2(just_heights, just_weights))
> 51.060 ms (1 allocation: 16 bytes)
There's also a transcript: https://www.cs.virginia.edu/~robins/YouAndYourResearch.html
I've re-read it at least 5 times, highly recommend it.
Of course it's easy! You just compare how much you've made, and how long you've stayed solvent, against the top 1% of traders. If you've already done just as well as the others, you'd in the top 1%. Otherwise, you aren't.
This object-level example is actually harder than it appears, performance of a fund or trader in one time period generally has very low correlation to the next, e.g. see this paper: https://www.researchgate.net/profile/David-Smith-256/publication/317605916_Evaluating_Hedge_Fund_Performance/links/5942df6faca2722db499cbce/Evaluating-Hedge-Fund-Performance.pdf
There's a fair amount of debate over how much data you need to evaluate whether a person is a consistently good trader, in my moderately-informed opinion a trader who does well over 2 years is significantly more likely to be lucky than skilled.
An incomplete list of caveats to Sharpe off the top of my head:
- We can never measure the true Sharpe of a strategy (how it would theoretically perform on average over all time), only the observed Sharpe ratio, which can be radically different, especially for strategies with significant tail risk. There are a wide variety of strategies that might have a very high observed sharpe over a few years, but much lower true Sharpe
- Sharpe typically doesn't measure costs like infrastructure or salaries, just losses to the direct fund. So e.g. you could view working at a company and earning a salary as a financial strategy with a nearly infinite sharpe, but that's not necessarily appealing. There are actually a fair number of hedge funds whose function is more similar to providing services in exchange for relatively guaranteed pay
- High-sharpe strategies are often constrained by capacity. For example, my friend once offered to pay me $51 on Venmo if I gave her $50 in cash, which is a very high return on investment given that the transaction took just a few minutes, but I doubt she would have been willing to do the same thing at a million times the scale. Similarly, there are occasionally investment strategies with very high sharpes that can only handle a relatively small amount of money
This is very, very cool. Having come from the functional programming world, I frequently miss these features when doing machine learning in Python, and haven't been able to easily replicate them. I think there's a lot of easy optimization that could happen in day-to-day exploratory machine learning code that bog standard pandas/scikit-learn doesn't do.
If N95 masks work, O95-100 and P95-100 masks should also work, and potentially be more effective - the stuff they filter is a superset of what N95 filters. They're normally more expensive, but in the current state I've actually found P100s cheaper than N95s.
I don't really understand what you mean by "from first principles" here. Do you mean in a way that's intuitive to you? Or in a way that includes all the proofs?
Any field of Math is typically more general than any one intuition allows, so it's a little dangerous to think in terms of what it's "really" doing. I find the way most people learn best is by starting with a small number of concrete intuitions – e.g., groups of symmetries for group theory, or posets for category theory – and gradually expanding.
In the case of Complex Analysis, I find the intuition of the Riemann Sphere to be particularly useful, though I don't have a good book recommendation.
One major confounder is that caffeine is also a painkiller, many people have mild chronic pain, and I think there's a very plausible mechanism by which painkillers improve productivity, i.e. just allowing someone to focus better.
Anecdotally, I've noticed that "resetting" caffeine tolerance is very quick compared to most drugs, taking something like 2-3 days without caffeine for several people I know, including myself.
The studies I could find on caffeine are highly contradictory, e.g. from Wikipedia, "Caffeine has been shown to have positive, negative, and no effects on long-term memory."
I'm under the impression that there's no general evidence for stimulants increasing productivity, although there are several specific cases, such as e.g. treating ADHD.
One key dimension is decomposition – I would say any gears model provides decomposition, but models can have it without gears.
For example, the error in any machine learning model can be broken down into bias + variance, which provides a useful model for debugging. But these don't feel like gears in any meaningful sense, whereas, say, bootstrapping + weak learners feel like gears in understanding Random Forests.
I think it is true that gears-level models are systematically undervalued, and that part of the reason is because of the longer payoff curve.
A simple example is debugging code: a gears-level approach is to try and understand what the code is doing and why it doesn't do what you want, a black-box approach is to try changing things somewhat randomly. Most programmers I know will agree that the gears-level approach is almost always better, but that they at least sometimes end up doing the black-box approach when tired/frustrated/stuck.
And in companies that focus too much on short-term results (most of them, IMO) will push programmers to spend far too much time on black-box debugging than is optimal.
Perhaps part of the reason why the choice appears to typically be obvious is that gears methods are underestimated.
Black-box approaches often fail to generalize within the domain, but generalize well across domains. Neural Nets may teach you less about medicine than a PGM, but they'll also get you good results in image recognition, transcription, etc.
This can lead to interesting principal-agent problems: an employee benefits more from learning something generalizable across businesses and industries, while employers will generally prefer the best domain-specific solution.
Nit: giving IQ tests is not super cheap, because it puts companies at a nebulous risk of being sued for disparate impact (see e.g. https://en.wikipedia.org/wiki/Griggs_v._Duke_Power_Co.).
I agree with all the major conclusions though.
For the orthogonal decomposition, don't you need two scalars? E.g. . For example, in , let Then , and there's no way to write as
My favorite book, by far, is Functional Programming in Scala. This book has you derive most of the concepts from scratch, to the point where even complex abstractions feel like obvious consequences of things you've already built.
If you want something more Haskell-focused, a good choice is Programming in Haskell.
I didn't downvote, but I agree that this is a suboptimal meme – though the prevailing mindset of "almost nobody can learn Calculus" is much worse.
As a datapoint, it took me about two weeks of obsessive, 15 hour/day study to learn Calculus to a point where I tested out of the first two courses when I was 16. And I think it's fair to say I was unusually talented and unusually motivated. I would not expect the vast majority of people to be able to grok Calculus within a week, though obviously people on this site are not a representative sample.
A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)
There are several related theorems. Roughly:
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you're thinking of, but you don't need the hypothesis to be in the class.
Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.