Posts
Best-of-N Jailbreaking
2024-12-14T04:58:48.974Z
Visualizing neural network planning
2024-05-09T06:40:46.582Z
Mechanistic Interpretability Workshop Happening at ICML 2024!
2024-05-03T01:18:26.936Z
Early Experiments in Reward Model Interpretation Using Sparse Autoencoders
2023-10-03T07:45:15.228Z
Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results
2023-02-23T10:48:08.766Z