Posts
Comments
It's true that risk alone isn't a good way to decide budgets. You're even more correct that convincing demands to spend money are something politicians learn to ignore out of necessity.
But while risk alone isn't a good way to decide budgets, you have to admit that lots of budget items have the purpose of addressing risk. For example, flood barriers address hurricane/typhoon rick. Structural upgrades address earthquake risk. Some preparations also address pandemic risk.
If you accept that some budget items are meant to address risk, shouldn't you also accept that the amount of spending should be somewhat proportional to the amount of risk? In that case, if the risk of NATO getting invaded is similar in amount to the rogue AGI risk, then the military spending to protect against invasion should be similar in amount to the spending to protect against rogue ASI.
I admit that politicians might not be rational enough to understand this, and there is a substantial probability this statement will fail. But it is still worth trying. The cost is a mere signature and the benefit may be avoiding a massive miscalculation.
Making this statement doesn't prevent others from making an even better statement. Many AI experts have signed multiple statements, e.g. the "Statement on AI Risk," and "Pause Giant AI Experiments." Some politicians and people are more convinced by one argument, while others are more convinced by another argument, so it helps to have different kinds of arguments backed by many signatories. Encouraging AI safety spending doesn't conflict with encouraging AI regulation. I think the competition between different arguments isn't actually that bad.
This is an important point. AI alignment/safety organizations take money as input and write very abstract papers as their output, which usually have no immediate applications. I agree it may appear very unproductive.
However, if we think from first principles, a lot of other things are like that. For instance, when you go to school, you study the works of Shakespeare, you learn to play the guitar, and you learn how Spanish pronouns work. These things appear to be a complete waste of time. If 50 million students in the US spend 1 hour a day on these kinds of activities, and each hour is valued at only $10, that's $180 billion/year.
But we know these things are not a waste of time, because in hindsight, when you study how students grow up, this work somehow helps them later in life.
Lots of things appear useless, but are valuable in hindsight for reasons beyond the intuitive set of reasons we evolved to understand.
Studying the nucleus of atoms might appear like a useless curiosity, if you didn't know it'll lead to nuclear energy. There are no real world applications for a long time but suddenly there are enormous applications.
Pasteur's studies on fermentation might appear limited to modest winemaking improvements, but it led to the discovery of germ theory which saved countless lives.
The stone age people studying weird rocks may have discovered obsidian and copper. Those who studied the strange seeds that plants produce may have discovered agriculture.
We don't know how valuable this alignment work is. We should cope with this uncertainty probabilistically: if there is a 50% chance it will help us, the benefits per cost is halved, but that doesn't reduce ideal spending to zero.
Once we get superintelligence, we might get every other technology that the laws of physics allow, even if we aren't that "close" to these other technologies.
Maybe they believe in a chance of superintelligence by 2039.
PS: Your comment may have caused it to drop to 38%. :)
It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.
I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).
Thank you for your alignment work :)
I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.
I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).
I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it's much harder to align a superintelligence if you don't know what's going on inside.
One small detail is defining your predictions better, as Dr. Shah said. It doesn't hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.
A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn't depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI's o1 will be able to get 90% on the GPQA Diamond, but didn't say when it'll happen. I'm not predicting very much in that case, and I can't judge how accurate my prediction was afterwards.
Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.
Thank you for your alignment work :)
That's fair. To be honest I've only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.
I'm not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much.
The mistake
The mistaken argument was an attempt to explain why . Believe it or not, the paper never actually argued , it argued . That's because the paper used the L2 loss, and .
I feel that was the main misunderstanding.
Figure 2b in the paper shows for most models but for some models.
The paper never mentions L2 loss, just that the loss function is "analytic in and minimized at ." Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption.
Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has in both L1 and L2 because you subtract or from . Even resolution-limited scaling may require subtracting loss due to noise . L2 is nicer because if is zero, but is undefined since the absolute value is undifferentiable at 0.
If you read closely, only refers to piecewise linear approximations:
, at large . Note that if the model provides an accurate piecewise linear approximation, we will generically find .
The second paper says the same:
If the model is piecewise linear instead of piecewise constant and is smooth with bounded derivatives, then the deviation , and so the loss will scale as . We would predict
Nearest neighbors regression/classification has . Skip this section if you already agree:
- If you use the nearest neighbors regression to fit the function , it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the "staircase" both smaller and smoother. Both the distances and the loss gradient decrease.
- My guess is that nearest neighbors classification also has L1 loss proportional to D^(-1/d), because the volume that is misclassified is roughly proportional to distance between the decision boundary and the nearest point. Again, I think it creates a decision boundary that is rough (analogous to the staircase), and the roughness gets smaller with additional data but never gets any smoother.
- Note the rough decision boundaries in a picture of nearest neighbors classification:
https://upload.wikimedia.org/wikipedia/commons/5/52/Map1NN.png
Suggestions
If you want a quick fix, you might change the following:
Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d. (EDIT: I no longer endorse the above argument, see the comments.)
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.
becomes
Under the assumption that is sufficiently “nice”, we can do a Taylor expansion of it around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, the difference is zero at the training data point. With a constant term of zero, we use the linear term (gradient displacement), which is proportional to the distance. So, we expect that our scaling law will look like kD^(-1/d). For loss functions like the L2 loss, which scale as the difference squared, it becomes kD^(-2/d) so α = 2/d.
The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the difference between the true value and the model's value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network (the linear term difference also decreases when the distance decreases). For L2 loss, we get that α = 4/d.
Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d.
Also change
Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss)
and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension)
to
Once again, we make the assumption that the learned model is better than a piecewise constant approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for a piecewise linear approximation)
and checking whether α >= 2/d. In most cases, they find that it is quite close to 4/d. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with α >= 2/d (though α is still relatively small -- language models just have a high intrinsic dimension)
Conclusion
Once AI scientists make a false conclusion like or , they may hallucinate arguments which justify the conclusion. A future research direction is to investigate whether large language models learned this behavior from the AI scientists who trained them.
I'm so sorry I'm so immature but I can't help it.
Overall, it's not a big mistake because it doesn't invalidate the gist of the summary. It's very subtle, unlike those glaring mistake that I've seen in works by other people... and myself. :)
These organizations just need a few volunteers for research or demonstrations. Once a lot of people sign up, cryonics will not be free again. It will cost tens of thousands as it normally does.
Even if they are nonprofit, they may behave as businesses because:
- They need your money.
- They compete with each other.
- Winning customers by any means might increase their competitiveness.