Posts
Comments
Comment by Lun on [deleted post]
2025-01-24T10:13:33.594Z
Gradient descent generates updates by suggesting algorithmic improvements for single training examples, thereby exerting much less pressure for generality than evolution does.
A recent technique Gradient Agreement Filtering filters out gradients that disagree between samples, which if I'm understanding correctly is intentionally breaking this crux and pushing for more generalization / less memorization of specific samples.
Vague intuition that with typical LM pretraining which involves batches of far more than 1 sample + optimizers with momentum this might already not hold, non-generalizing / noisy / disagreeing updates are unimportant over the training run and the generalizing, agreeing updates stick around.