ojorgensen's Shortform

ojorgensen

ojorgensen's Shortform

post by ojorgensen · 2023-05-04T13:51:33.152Z · LW · GW · 1 comments

1 comment

1 comments

Comments sorted by top scores.

comment by ojorgensen · 2023-05-04T13:51:33.242Z · LW(p) · GW(p)

Problem: we want to make it hard for ML systems (trained via SGD) to perform naive gradient hacking. By naive gradient hacking, I mean "being able to keep some weights of the network constant for an arbitrary step of SGD".

Solution: do "stochastic" regularisation, e.g. sample the amount of regularisation we perform randomly (could use quantum stuff if we want true randomness). This seems like it should make naive gradient hacking almost impossible - in order to keep some target weights unchanged, you'd have to match their +ve contribution to the loss to the degree of regularisation. If the degree of regularisation is stochastic, the loss contribution must also be stochastic, which is not possible with a determinstic NN!

The question of how useful this is from preventing any gradient hacking depends on how stable the loss landscape around some "deceptive / gradient hacking minimum is". Seems possible the surrounding loss landscape could be pretty unstable to random pertubations?

ojorgensen's Shortform

Contents

1 comments