LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

next page (older posts) →

Recent comments

faul_sname on Duct Tape security

BTW as a concrete note, you may want to sub in 15 - ceil(log10(n)) instead of just "15", which really only matters if you're dealing with numbers above 10 (e.g. 1000 is represented as 0x408F400000000000, while the next float 0x408F400000000001 is 1000.000000000000114, which differs in the 13th decimal place).

jason-gross on Sparsify: A mechanistic interpretability research agenda

We propose a simple fix: Use instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with "small errors" that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.

See here for some brief write-up and animations.

kingsupernova on Duct Tape security

Hmm, interesting. The exact choice of decimal place at which to cut off the comparison is certainly arbitrary, and that doesn't feel very elegant. My thinking is that within the constraint of using floating point numbers, there fundamentally isn't a perfect solution. Floating point notation changes some numbers into other numbers, so there are always going to be some cases where number comparisons are wrong. What we want to do is define a problem domain and check if floating point will cause problems within that domain; if it doesn't, go for it, if it does, maybe don't use floating point.

In this case my fix solves the problem for what I think is the vast majority of the most likely inputs (in particular it solves it for all the inputs that my particular program was going to get), and while it's less fundamental than e.g. using arbitrary-precision arithmetic, it does better on the cost-benefit analysis. (Just like how "completely overhaul our company" addresses things on a more fundamental level than just fixing the structural simulation, but may not be the best fix given resource constraints.)

The main purpose of my example was not to argue that my particular approach was the "correct" one, but rather to point out the flaws in the "multiply by an arbitrary constant" approach. I'll edit that line, since I think you're right that it's a little more complicated than I was making it out to be, and "trivial" could be an unfair characterization.

justismills on LLMs seem (relatively) safe

Maybe worth a slight update on how the AI alignment community would respond? Doesn't seem like any of the comments on this post are particularly aggressive. I've noticed an effect where I worry people will call me dumb when I express imperfect or gestural thoughts, but it usually doesn't happen. And if anyone's secretly thinking it, well, that's their business!

andrew-burns on Andrew Burns's Shortform

So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta's promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing--I would point out though that Alibaba's Qwen model seems to do pretty okay in the arena...anyway, my point is that I don't think the "what if China" argument can be dismissed as quickly as some people on here seem to be ready to do.

lawrencec on Sparse autoencoders find composed features in small toy models

It's actually worse than what you say -- the first two datasets studied here have privileged basis 45 degrees off from the standard one, which is why the SAEs seem to continue learning the same 45 degree off features. Unpacking this sentence a bit: it turns out that both datasets have principle components 45 degrees off from the basis the authors present as natural, and so as SAE in a sense are trying to capture the principle directions of variation in the activation space, they will also naturally use features 45 degrees off from the "natural" basis.

Consider the first example -- by construction, since x_1 and x_2 are anticorrelated perfectly, as are y_1 and y_2, the data is 2 dimensional and can be represented as x = x_1 - x_2 and y = y_1 - y_2. Indeed, this this is exactly what their diagram is assuming. But here, x and y have the same absolute magnitude by construction, and so the dataset lies entirely on the diagonals of the unit square, and the principal components are obviously the diagonals.

Now, why does the SAE want to learn the principle components? This is because it allows the SAE to have smaller activations on average for a given weight norm.

Consider the representation that is axis aligned, in that the SAE neurons are x_1, x_2, y_1, y_2 -- since there's weight decay, the encoding and decoding weights want to be of the same magnitude. Let's suppose that the encoding and decoding weights are of size s. Now, if the features are axis aligned, the total size of the activations will be 2A/s^2. But if you instead use the neurons aligned with x_1 + y_1, x_1 + y_2, x_2 + y_1, x_2 + y_2, the activations only need to be of size \sqrt 2 A/s^2. This means that a non-axis aligned representation will have lower loss. Indeed, something like this story is why we expect the L1 penalty to recover "true features" in the first place.

The story for the second dataset is pretty similar to the first -- when the data is uniformly distributed over a unit square, the principle directions are the diagonals of the square, not the standard basis.

mitchell_porter on dirk's Shortform

Also astronomers: anything heavier than helium is a "metal".

faul_sname on Duct Tape security

That makes sense. I think I may have misjudged your post, as I expected that you would classify that kind of approach as a "duct tape" approach.

kingsupernova on Duct Tape security

In the general case I agree it's not necessarily trivial; e.g. if your program uses the whole range of decimal places to a meaningful degree, or performs calculations that can compound floating point errors up to higher decimal places. (Though I'd argue that in both of those cases pure floating point is probably not the best system to use.) In my case I knew that the intended precision of the input would never be precise enough to overlap with floating point errors, so I could just round anything past the 15th decimal place down to 0.

quiet_nan on Duct Tape security

The sum of two numbers should have a precision no higher than the operand with the highest precision. For example, adding 0.1 + 0.2 should yield 0.3, not 0.30000000000000004.

I would argue that the precision should be capped at the lowest precision of the operands. In physics, if you add to lengths, 0.123m+0.123456m should be rounded to 0.246m.

Also, IEEE754 fundamentally does not contain information about the precision of a number. If you want to track that information correctly, you can use two floating point numbers and do interval arithmetic. There is even an IEEE standard for that nowadays.

Of course, this comes at a cost. While monotonic functions can be converted for interval arithmetic, the general problem of finding the extremal values of a function in some high-dimensional domain is a hard problem. Of course, if you know how the function is composed out of simpler operations, you can at least find some bounds.

Or you could do what physicists do (at least when they are taking lab courses) and track physical quantities with a value and a precision, and do uncertainty propagation. (This might not be 100% kosher in cases where you first calculate multiple intermediate quantities from the same measurement (whose error will thus not be independent) and continue to treat them as if they were. But that might just give you bigger errors.) Also, this relies on your function being sufficiently well-described in the region of interest by the partial derivatives at the central point. If you calculate the uncertainty of for $x = 0.1 \pm 1$ , $y = 0.1 \pm 1$ using the partial derivatives you will not have fun.

LessWrong 2.0 Reader

Archive

Recent comments