Posts

Two Designs 2021-04-22T16:32:26.425Z
Moving Data Around is Slow 2021-03-22T02:36:08.480Z
Kalman Filter for Bayesians 2018-10-22T17:06:02.783Z
Systemizing and Hacking 2018-03-23T18:01:33.212Z
Inference & Empiricism 2018-03-20T15:47:00.316Z
Examples of Mitigating Assumption Risk 2017-11-30T02:09:23.852Z
Competitive Truth-Seeking 2017-11-01T12:06:59.314Z
How to Learn from Experts 2013-10-04T17:02:01.502Z
Systematic Lucky Breaks 2013-10-03T01:46:26.652Z

Comments

Comment by SatvikBeri on Always know where your abstractions break · 2022-11-28T17:42:59.755Z · LW · GW

One sort-of counterexample would be The Unreasonable Effectiveness of Mathematics in the Natural Sciences, where a lot of Math has been surprisingly accurate even when the assumptions where violated.

Comment by SatvikBeri on Request for small textbook recommendations · 2022-05-25T23:12:39.429Z · LW · GW

The Mathematical Theory of Communication by Shannon and Weaver. It's an extended version of Shannon's original paper that established Information Theory, with some extra explanations and background. 144 pages.

Comment by SatvikBeri on Request for small textbook recommendations · 2022-05-25T23:10:22.798Z · LW · GW

Atiyah & McDonald's Introduction to Commutative Algebra fits. It's 125 pages long, and it's possible to do all the exercises in 2-3 weeks – I did them over winter break in preparation for a course.

Lang's Algebra and Eisenbud's Commutative Algebra are both supersets of Atiyah & McDonald, I've studied each of those as well and thought A&M was significantly better.

Comment by SatvikBeri on Taking the outside view on code quality · 2021-05-07T23:01:12.953Z · LW · GW

Unfortunately, I think it isn't very compatible with the way management works at most companies. Normally there's pressure to get your tickets done quickly, which leaves less time for "refactor as you go".

I've heard this a lot, but I've worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?

Comment by SatvikBeri on Why are the websites of major companies so bad at core functionality? · 2021-05-06T18:34:09.604Z · LW · GW

A big piece is that companies are extremely siloed by default. It's pretty easy for a team to improve things in their silo, it's significantly harder to improve something that requires two teams, it's nearly impossible to reach beyond that.

Uber is particularly siloed, they have a huge number of microservices with small teams, at least according to their engineering talks on youtube. Address validation is probably a separate service from anything related to maps, which in turn is separate from contacts. 

Because of silos, companies have to make an extraordinary effort to actually end up with good UX. Apple was an example of these, where it was literally driven by the founder & CEO of the company. Tumblr was known for this as well. But from what I heard, Travis was more of a logistics person than a UX person, etc.

(I don't think silos explain the bank validation issue)

Comment by SatvikBeri on [link] If something seems unusually hard for you, see if you're missing a minor insight · 2021-05-05T16:23:43.259Z · LW · GW

Cooking: 

  • Smelling ingredients & food is a good way to develop intuition about how things will taste when combined
  • Salt early is generally much better than salt late

Data Science:

  • Interactive environments like Jupyter notebooks are a huge productivity win, even with their disadvantages
  • Automatic code reloading makes Jupyter much more productive (e.g. autoreload for Python, or Revise for Julia)
  • Bootstrapping gives you fast, accurate statistics in a lot of areas without needing to be too precise about theory

Programming:

  • Do everything in a virtual environment or the equivalent for your language. Even if you use literally one environment on your machine, the tooling around these tends to be much better
  • Have some form of reasonably accurate, reasonably fast feedback loop(s). Types, tests, whatever – the best choice depends a lot on the problem domain. But the worst default is no feedback loop

Ping-pong:

  • People adapt to your style very rapidly, even within a single game. Learn 2-3 complementary styles and switch them up when somebody gets used to one

Friendship:

  • Set up easy, default ways to interact with your friends, such as getting weekly coffees, making it easy for them to visit, hosting board game nights etc.
  • Take notes on what your friends like
  • When your friends have persistent problems, take notes on what they've tried. When you hear something they haven't tried, recommend it. This is both practical and the fact that you've customized it is generally appreciated

Conversations:

  • Realize that small amounts of awkwardness, silence etc. are generally not a problem. I was implicitly following a strategy that tried to absolutely minimize awkwardness for a long time, which was a bad idea
Comment by SatvikBeri on [link] If something seems unusually hard for you, see if you're missing a minor insight · 2021-05-05T15:36:12.065Z · LW · GW
  • using vector syntax is much faster than loops in Python

To generalize this slightly, using Python to call C/C++ is generally much faster than pure Python. For example, built-in operations in Pandas tend to be pretty fast, while using .apply() is usually pretty slow.

Comment by SatvikBeri on Two Designs · 2021-04-30T16:51:54.614Z · LW · GW

I didn't know about that, thanks!

Comment by SatvikBeri on Facebook is Simulacra Level 3, Andreessen is Level 4 · 2021-04-28T19:13:04.686Z · LW · GW
Comment by SatvikBeri on Spoiler-Free Reviews: Monster Slayers, Dream Quest and Loop Hero · 2021-04-26T16:35:56.828Z · LW · GW

I found Loop Hero much better with higher speed, which you can fix by modifying a variables.ini file: https://www.pcinvasion.com/loop-hero-speed-mod/

Comment by SatvikBeri on Is there a good software solution for mathematical questions? · 2021-04-25T23:16:03.232Z · LW · GW

I've used Optim.jl for similar problems with good results, here's an example: https://julianlsolvers.github.io/Optim.jl/stable/#user/minimization/

Comment by SatvikBeri on Two Designs · 2021-04-23T03:39:31.744Z · LW · GW

The general lesson is that "magic" interfaces which try to 'do what I mean' are nice to work with at the top-level, but it's a lot easier to reason about composing primitives if they're all super-strict.

100% agree. In general I usually aim to have a thin boundary layer that does validation and converts everything to nice types/data structures, and then a much stricter core of inner functionality. Part of the reason I chose to write about this example is because it's very different from what I normally do. 

Important caveat for the pass-through approach: if any of your build_dataset() functions accept **kwargs, you have to be very careful about how they're handled to preserve the property that "calling a function with unused arguments is an error". It was a lot of work to clean this up in Matplotlib...

To make the pass-through approach work, the build_dataset functions do accept excess parameters and throw them away. That's definitely a cost. The easiest way to handle it is to have the build_dataset functions themselves just pass the actually needed arguments to a stricter, core function, e.g.:

def build_dataset(a, b, **kwargs):
    build_dataset_strict(a, b)
    
    
build_dataset(**parameters) # Succeeds as long as keys named "a" and "b" are in parameters
Comment by SatvikBeri on Two Designs · 2021-04-22T20:59:30.478Z · LW · GW

This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create.
 

This is true. We're using AWS Batch because it's the best tool we could find for other jobs that actually do need hundreds/thousands of spot instances, and this particular job goes in the middle of those. If most of our jobs looked like this one, using Batch wouldn't make sense.

You get language-level validation either way. The assert statements are superfluous in that sense. What they do add is in effect check_dataset_params(), whose logic probably doesn't belong in this file.

You're right. In the explicit example, it makes more sense to have that sort of logic at the call site. 

Comment by SatvikBeri on Two Designs · 2021-04-22T19:42:35.100Z · LW · GW

The reason to be explicit is to be able to handle control flow.

The datasets aren't dependent on each other, though some of them use the same input parameters.

If your jobs are independent, then they should be scheduled as such. This allows jobs to run in parallel.

Sure, there's some benefit to breaking down jobs even further. There's also overhead to spinning up workers. Each of these functions takes ~30s to run, so it ends up being more efficient to put them in one job instead of multiple.

Your errors would come out just as fast if you ran check_dataset_params() up front.

So then you have to maintain check_dataset_params, which gives you a level of indirection. I don't think this is likely to be much less error-prone.

The benefit of the pass-through approach is that it uses language-level features to do the validation – you simply check whether the parameters dict has keywords for each argument the function is expecting.

A good way to increase feedback rate is to write better tests.

I agree in general, but I don't think there are particularly good ways to test this without introducing indirection.

Failure in production should be the exception, not the norm.

The failure you're talking about here is tripping a try clause. I agree that exceptions aren't the best control flow – I would prefer if the pattern I'm talking about could be implemented with if statements – but it's not really a major failure, and (unfortunately) a pretty common pattern in Python. 

Comment by SatvikBeri on Thiel on secrets and indefiniteness · 2021-04-22T18:02:12.709Z · LW · GW

"refine definite theories"

Where does this quote come from – is it in the book?

Comment by SatvikBeri on [Letter] Advice for High School #1 · 2021-04-20T16:10:37.477Z · LW · GW

Is there a reason you recommend Hy instead of Clojure? I would suggest Clojure to most people interested in Lisp these days, due to the overwhelmingly larger community, ecosystem, & existence of Clojurescript. 

Comment by SatvikBeri on Place-Based Programming - Part 1 - Places · 2021-04-16T02:58:39.456Z · LW · GW

Ah, that's a great example, thanks for spelling it out.

Comment by SatvikBeri on Place-Based Programming - Part 1 - Places · 2021-04-15T16:32:11.284Z · LW · GW

This is sometimes true in functional programming, but only if you're careful.

I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.

For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

As I understand, this system is mostly useful if you're using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function's source code changed.

Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables.

Agreed.

So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.

I agree that it's essentially a framework, and you'd need buy-in from a team in order to consistently use it in a repository. But I've seen teams buy into heavier frameworks pretty regularly; this version seems unusual but not particularly hard to use/understand. It's worth noting that bad caching systems are pretty common in data science, so something like this is potentially a big improvement there.

Comment by SatvikBeri on Place-Based Programming - Part 1 - Places · 2021-04-14T23:39:12.677Z · LW · GW
Comment by SatvikBeri on Place-Based Programming - Part 1 - Places · 2021-04-14T23:15:36.771Z · LW · GW

This is very cool. The focus on caching a code block instead of just the inputs to the function makes it significantly more stable, since your cache will be automatically invalidated if you change the code in any way.

Comment by SatvikBeri on Vim · 2021-04-09T16:13:39.535Z · LW · GW

If you're using non-modal editing, in that example you could press Alt+rightarrow three times, use cmd+f, the end key (and back one word), or cmd+righarrow (and back one word). That's not even counting shortcuts specific to another IDE or editor. Why, in your mental model, does the non-modal version feel like fewer choices? I suspect it's just familiarity – you've settled on some options you use the most, rather than trying to calculate the optimum fewest keystrokes each time.

Have you ever seen an experienced vim user? 3-5 seconds latency is completely unrealistic. It sounds to me like you're describing the experience of being someone who's a beginner at vim and spent half their life into non-modal editing, and in that case, of course you're going to be much faster with the second. And to be fair, vim is extremely beginner-unfriendly in ways that are bad and could be fixed without harming experts – kakoune(https://kakoune.org/) is similar but vastly better designed for learning.

As a side note, this is my last post in this conversation. I feel like we have mostly been repeating the same points and going nowhere.

Comment by SatvikBeri on Vim · 2021-04-09T15:45:35.952Z · LW · GW

I ended up using cmd+shift+i which opens the find/replace panel with the default set to backwards.

Comment by SatvikBeri on Vim · 2021-04-08T13:35:55.409Z · LW · GW

So, one of the arguments you've made at several points is that we should expect Vim to be slower because it has more choices. This seems incorrect to me, even a simple editor like Sublime Text has about a thousand keyboard shortcuts, which are mostly ad-hoc and need to be memorized separately. In contrast Vim has a small, (mostly) composable language. I just counted lsusr's post, and it has fewer than 30 distinct components – most of the text is showing different ways to combine them.

The other thing to consider is that most programmers will use at least a dozen editors/IDEs in their careers. I have 5 open on my laptop right now, and it's not because I want to! Vim provides a unified set of key bindings among practically every editor, which normally have very different ways of doing things.

So that's on the order of a 10x-100x order of magnitude reduction in vocabulary size, which should at least make you consider the idea that Vim has lower latency.

Comment by SatvikBeri on Vim · 2021-04-08T01:01:19.049Z · LW · GW

I did :Tutor on neovim and only did commands that actually involved editing text, it took 5:46.

Now trying in Sublime Text. Edit: 8:38 in Sublime, without vim mode – a big difference! It felt like it was mostly uniform, but one area where I was significantly slower was search and replace, because I couldn't figure out how to go backwards easily.

Comment by SatvikBeri on Vim · 2021-04-07T23:42:41.719Z · LW · GW

This is a great experiment, I'll try it out too. I also have pretty decent habits for non-vim editing so it'll be interesting to see.

Comment by SatvikBeri on Vim · 2021-04-07T23:17:00.132Z · LW · GW

Some IDEs are just very accommodating about this, e.g. PyCharm. So that's great.

Some of them aren't, like VS Code. For those, I just manually reconfigure the clashing key bindings. It's annoying, but it only takes ~15 minutes total.

Comment by SatvikBeri on Vim · 2021-04-07T22:49:40.651Z · LW · GW

I would expect using VIM to increase latency. While you are going to press fewer keys you are likely going to take slightly longer to press the keys as using any key is more complex.

This really isn't my experience. Once you've practiced something enough that it becomes a habit, the latency is significantly lower. Anecdotally, I've pretty consistently seen people who're used to vim accomplish text editing tasks much faster than people who aren't, unless the latter is an expert in keyboard shortcuts of another editor such as emacs.

There's the paradox of choice and having more choices to accomplish a task costs mental resources. Vim forces me to spent cognitive resources to chose between different alternatives of how to accomplish a task.

All the professional UX people seem to advocate making interfaces as simple as possible.

You want simple interfaces for beginners. Interfaces popular among professionals tend to be pretty complex, see e.g. Bloomberg Terminal or Photoshop or even Microsoft Excel.

Comment by SatvikBeri on Vim · 2021-04-07T21:12:06.147Z · LW · GW

As far as I know there's almost no measurement of productivity of developer tools. Without data, I think there are two main categories in which editor features, including keyboard shortcuts, can make you more productive:

  1. By making difficult tasks medium to easy
  2. By making ~10s tasks take ~1s

An example of the first would be automatically syncing your code to a remote development instance. An example of the first would be adding a comma to the end of several lines at once using a macro. IDEs tend to focus on 1, text editors tend to focus on 2.

In general, I think it's very likely that the first case makes you more productive. What about the second?

My recollection is that in studies of how humans respond to feedback, there are large differences between even relatively short periods of latency. Something like vim gives you hundreds of these (learning another editor's keyboard shortcuts very well probably does too.) I can point to dozens of little things that are easier with vim, conversely, nothing is harder because you can always just drop into insert mode.

I agree that this isn't nearly as convincing as actual studies would be, but constructing a reasonable study on this seems pretty difficult.

Comment by SatvikBeri on Moving Data Around is Slow · 2021-03-22T20:28:42.407Z · LW · GW

Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.

Yeah, SIMD is a significant point I forgot to mention.

It's a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard! StructArrays.jl makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.

Comment by SatvikBeri on Moving Data Around is Slow · 2021-03-22T18:36:26.125Z · LW · GW

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.  Yet we should not pass up our opportunities in that critical 3%." – Donald Knuth

Comment by SatvikBeri on Moving Data Around is Slow · 2021-03-22T18:32:37.699Z · LW · GW

Yup, these are all reasons to prefer column orientation over row orientation for analytics workloads. In my opinion data locality trumps everything but compression and fast transmission is definitely very nice.

Until recently, numpy and pandas were row oriented, and this was a major bottleneck. A lot of pandas's strange API is apparently due to working around row orientation. See e.g. this article by Wes McKinney, creator of pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/#:~:text=Arrow's%20C%2B%2B%20implementation%20provides%20essential,optimized%20for%20analytical%20processing%20performance

Comment by SatvikBeri on Moving Data Around is Slow · 2021-03-22T16:57:31.018Z · LW · GW

I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:

function my_double_sum(data)
    sum_heights_and_weights = 0
    for row in data
        sum_heights_and_weights += row.weight + row.height
    end
    return sum_heights_and_weights
end
@btime(my_double_sum(heights_and_weights))
>   50.342 ms (1 allocation: 16 bytes)
function my_double_sum2(heights, weights)
    sum_heights_and_weights = 0
    for (height, weight) in zip(heights, weights)
        sum_heights_and_weights += height + weight
    end
    return sum_heights_and_weights
end
@btime(my_double_sum2(just_heights, just_weights))
>   51.060 ms (1 allocation: 16 bytes)
Comment by SatvikBeri on "You and Your Research" – Hamming Watch/Discuss Party · 2021-03-21T04:58:00.364Z · LW · GW

There's also a transcript: https://www.cs.virginia.edu/~robins/YouAndYourResearch.html

I've re-read it at least 5 times, highly recommend it.

Comment by SatvikBeri on Strong Evidence is Common · 2021-03-14T01:06:03.791Z · LW · GW

Of course it's easy! You just compare how much you've made, and how long you've stayed solvent, against the top 1% of traders. If you've already done just as well as the others, you'd in the top 1%. Otherwise, you aren't.

This object-level example is actually harder than it appears, performance of a fund or trader in one time period generally has very low correlation to the next, e.g. see this paper: https://www.researchgate.net/profile/David-Smith-256/publication/317605916_Evaluating_Hedge_Fund_Performance/links/5942df6faca2722db499cbce/Evaluating-Hedge-Fund-Performance.pdf

There's a fair amount of debate over how much data you need to evaluate whether a person is a consistently good trader, in my moderately-informed opinion a trader who does well over 2 years is significantly more likely to be lucky than skilled. 

Comment by SatvikBeri on What posts on finance would your find helpful or interesting? · 2020-08-24T16:49:22.293Z · LW · GW

An incomplete list of caveats to Sharpe off the top of my head:

  • We can never measure the true Sharpe of a strategy (how it would theoretically perform on average over all time), only the observed Sharpe ratio, which can be radically different, especially for strategies with significant tail risk. There are a wide variety of strategies that might have a very high observed sharpe over a few years, but much lower true Sharpe
  • Sharpe typically doesn't measure costs like infrastructure or salaries, just losses to the direct fund. So e.g. you could view working at a company and earning a salary as a financial strategy with a nearly infinite sharpe, but that's not necessarily appealing. There are actually a fair number of hedge funds whose function is more similar to providing services in exchange for relatively guaranteed pay
  • High-sharpe strategies are often constrained by capacity. For example, my friend once offered to pay me $51 on Venmo if I gave her $50 in cash, which is a very high return on investment given that the transaction took just a few minutes, but I doubt she would have been willing to do the same thing at a million times the scale. Similarly, there are occasionally investment strategies with very high sharpes that can only handle a relatively small amount of money
Comment by SatvikBeri on What Surprised Me About Entrepreneurship · 2020-04-09T23:21:28.266Z · LW · GW

This is very, very cool. Having come from the functional programming world, I frequently miss these features when doing machine learning in Python, and haven't been able to easily replicate them. I think there's a lot of easy optimization that could happen in day-to-day exploratory machine learning code that bog standard pandas/scikit-learn doesn't do.

Comment by SatvikBeri on Coronavirus: Justified Practical Advice Thread · 2020-02-29T03:42:09.832Z · LW · GW

If N95 masks work, O95-100 and P95-100 masks should also work, and potentially be more effective - the stuff they filter is a superset of what N95 filters. They're normally more expensive, but in the current state I've actually found P100s cheaper than N95s.

Comment by SatvikBeri on Learning Abstract Math from First Principles? · 2019-12-03T01:41:30.491Z · LW · GW

I don't really understand what you mean by "from first principles" here. Do you mean in a way that's intuitive to you? Or in a way that includes all the proofs?

Any field of Math is typically more general than any one intuition allows, so it's a little dangerous to think in terms of what it's "really" doing. I find the way most people learn best is by starting with a small number of concrete intuitions – e.g., groups of symmetries for group theory, or posets for category theory – and gradually expanding.

In the case of Complex Analysis, I find the intuition of the Riemann Sphere to be particularly useful, though I don't have a good book recommendation.

Comment by SatvikBeri on Is daily caffeine consumption beneficial to productivity? · 2019-11-26T17:17:29.391Z · LW · GW

One major confounder is that caffeine is also a painkiller, many people have mild chronic pain, and I think there's a very plausible mechanism by which painkillers improve productivity, i.e. just allowing someone to focus better.

Anecdotally, I've noticed that "resetting" caffeine tolerance is very quick compared to most drugs, taking something like 2-3 days without caffeine for several people I know, including myself.

The studies I could find on caffeine are highly contradictory, e.g. from Wikipedia, "Caffeine has been shown to have positive, negative, and no effects on long-term memory."

I'm under the impression that there's no general evidence for stimulants increasing productivity, although there are several specific cases, such as e.g. treating ADHD.

Comment by SatvikBeri on Gears-Level Models are Capital Investments · 2019-11-24T19:40:47.826Z · LW · GW

One key dimension is decomposition – I would say any gears model provides decomposition, but models can have it without gears.

For example, the error in any machine learning model can be broken down into bias + variance, which provides a useful model for debugging. But these don't feel like gears in any meaningful sense, whereas, say, bootstrapping + weak learners feel like gears in understanding Random Forests.

Comment by SatvikBeri on Gears-Level Models are Capital Investments · 2019-11-24T19:35:14.662Z · LW · GW

I think it is true that gears-level models are systematically undervalued, and that part of the reason is because of the longer payoff curve.

A simple example is debugging code: a gears-level approach is to try and understand what the code is doing and why it doesn't do what you want, a black-box approach is to try changing things somewhat randomly. Most programmers I know will agree that the gears-level approach is almost always better, but that they at least sometimes end up doing the black-box approach when tired/frustrated/stuck.

And in companies that focus too much on short-term results (most of them, IMO) will push programmers to spend far too much time on black-box debugging than is optimal.

Perhaps part of the reason why the choice appears to typically be obvious is that gears methods are underestimated.

Comment by SatvikBeri on Gears-Level Models are Capital Investments · 2019-11-23T20:57:28.941Z · LW · GW

Black-box approaches often fail to generalize within the domain, but generalize well across domains. Neural Nets may teach you less about medicine than a PGM, but they'll also get you good results in image recognition, transcription, etc.

This can lead to interesting principal-agent problems: an employee benefits more from learning something generalizable across businesses and industries, while employers will generally prefer the best domain-specific solution.

Comment by SatvikBeri on Inefficient Doesn’t Mean Indifferent · 2019-10-18T14:55:31.766Z · LW · GW

Nit: giving IQ tests is not super cheap, because it puts companies at a nebulous risk of being sued for disparate impact (see e.g. https://en.wikipedia.org/wiki/Griggs_v._Duke_Power_Co.).

I agree with all the major conclusions though.

Comment by SatvikBeri on Insights from Linear Algebra Done Right · 2019-07-14T22:47:31.261Z · LW · GW

For the orthogonal decomposition, don't you need two scalars? E.g. . For example, in , let Then , and there's no way to write as

Comment by SatvikBeri on What are good resources for learning functional programming? · 2019-07-05T06:06:01.526Z · LW · GW

My favorite book, by far, is Functional Programming in Scala. This book has you derive most of the concepts from scratch, to the point where even complex abstractions feel like obvious consequences of things you've already built.

If you want something more Haskell-focused, a good choice is Programming in Haskell.

Comment by SatvikBeri on Raemon's Shortform · 2019-07-03T22:16:35.871Z · LW · GW

I didn't downvote, but I agree that this is a suboptimal meme – though the prevailing mindset of "almost nobody can learn Calculus" is much worse.

As a datapoint, it took me about two weeks of obsessive, 15 hour/day study to learn Calculus to a point where I tested out of the first two courses when I was 16. And I think it's fair to say I was unusually talented and unusually motivated. I would not expect the vast majority of people to be able to grok Calculus within a week, though obviously people on this site are not a representative sample.

Comment by SatvikBeri on What are principled ways for penalising complexity in practice? · 2019-07-02T16:58:29.042Z · LW · GW

A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)

There are several related theorems. Roughly:

1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.

2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality

3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)

3 is probably the one you're thinking of, but you don't need the hypothesis to be in the class.

Comment by SatvikBeri on What are principled ways for penalising complexity in practice? · 2019-06-29T20:09:14.047Z · LW · GW

Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.

Comment by SatvikBeri on What are principled ways for penalising complexity in practice? · 2019-06-28T19:52:54.023Z · LW · GW

A different view is to look at the search process for the models, rather than the model itself. If model A is found from a process that evaluates 10 models, and model B is found from a process that evaluates 10,000, and they otherwise have similar results, then A is much more likely to generalize to new data points than B.

The formalization of this concept is called VC dimension and is a big part of Machine Learning Theory (although arguably it hasn't been very helpful in practice): https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension

Comment by SatvikBeri on Crypto quant trading: Naive Bayes · 2019-05-09T17:18:02.987Z · LW · GW

It's a combination. The point is to throw out algorithms/parameters that do well on backtests when the assumptions are violated, because those are much more likely to be overfit.