Posts

Backdoors have universal representations across large language models 2024-12-06T22:56:33.519Z
Early Experiments in Reward Model Interpretation Using Sparse Autoencoders 2023-10-03T07:45:15.228Z

Comments