LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

danielfilan on Bayesian inference without priors

Why wouldn't this construction work over a continuous space?

faul_sname on Planning in a Lattice Graph

Do you want me to spoil it for you, do you want me to drop a hint, or do you want to puzzle it out yourself? It's a beautiful little puzzle and very satisfying to solve.

avturchin on Magic by forgetting

Presumably in deep meditation people become disconnected from reality.

avturchin on Magic by forgetting

Yes it is easy to forget something if it does become a part of your personality. So a new bad thing is easier to forget.

jessica-liu-taylor on Bayesian inference without priors

I don't see how this helps. You can have a 1:1 prior over the question you're interested in (like U1), however, to compute the likelihood ratios, it seems you would need a joint prior over everything of interest (including LL and E). There are specific cases where you can get a likelihood ratio without a joint prior (such as, likelihood of seeing some coin flips conditional on coin biases) but this doesn't seem like a case where this is feasible.

avturchin on Magic by forgetting

The number of poor people is much larger than billionairs. So in most cases you will fail to wake up as a billionaire. But sometimes it will work and it is similar to law of attraction. But formulation via forgetting is more beautiful. You forget that you are poor.

bhauth on social lemon markets

You're mistaken about lemon markets: the initial fraction of lemons does matter. The number of lemon cars is fixed, and it imposes a sort of tax on transactions, but if that tax is low enough, it's still worth selling good cars. There's a threshold effect, a point at which most of the good items are suddenly driven out.

avturchin on Magic by forgetting

I can forget one particular thing, but preserve most of my selfidentification information

alexander-gietelink-oldenziel on Examples of Highly Counterfactual Discoveries?

Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!

But also yes... I think I am saying that

Singular Learning Theory is the first highly accurate model of breath of optima.
- SLT tells us to look at a quantity Watanabe calls , which has the highly-technical name 'real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
- By computing simple examples (see Shaowei's guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
- The RLCT = $λ$ first-order term for in-distribution generalization error and also Bayesian learning (technically the 'Bayesian free energy'). This justifies the name of 'learning coefficient' for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions.
- Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won't be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
It's one of the most computationally applicable ones we have? Yes. SLT quantities like the RLCT can be analytically computed for many statistical models of interest, correctly predicts phase transitions in toy neural networks and it can be estimated at scale.

This doesn't get into the groundbreaking upcoming new work by Simon-Pepin Lehalleur recovering the RLCT as the asymptotic dimension of jet schemes around which suggest a much more mathematically precise conception of basins and their breadth.

zac-hatfield-dodds on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

It's a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!

LessWrong 2.0 Reader

Archive

Recent comments