Posts

Comments

Comment by Lun on [deleted post] 2025-01-24T10:13:33.594Z

Gradient descent generates updates by suggesting algorithmic improvements for single training examples, thereby exerting much less pressure for generality than evolution does.

A recent technique Gradient Agreement Filtering filters out gradients that disagree between samples, which if I'm understanding correctly is intentionally breaking this crux and pushing for more generalization / less memorization of specific samples.

Vague intuition that with typical LM pretraining which involves batches of far more than 1 sample + optimizers with momentum this might already not hold, non-generalizing / noisy / disagreeing updates are unimportant over the training run and the generalizing, agreeing updates stick around.