Posts

Best-of-N Jailbreaking 2024-12-14T04:58:48.974Z
Visualizing neural network planning 2024-05-09T06:40:46.582Z
Mechanistic Interpretability Workshop Happening at ICML 2024! 2024-05-03T01:18:26.936Z
Early Experiments in Reward Model Interpretation Using Sparse Autoencoders 2023-10-03T07:45:15.228Z
Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results 2023-02-23T10:48:08.766Z

Comments