Posts
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
2025-02-07T03:57:30.904Z
Comments
Comment by
ChrisCundy on
Transformers Represent Belief State Geometry in their Residual Stream ·
2024-04-17T06:10:09.669Z ·
LW ·
GW
The figures remind me of figures 3 and 4 from Meta-learning of Sequential Strategies, Ortega et al 2019, which also study how autoregressive models (RNNs) infer underlying structure. Could be a good reference to check out!
.
Comment by
ChrisCundy on
SolidGoldMagikarp (plus, prompt generation) ·
2023-02-06T22:22:36.614Z ·
LW ·
GW
Thanks for the elaboration, I'll follow up offline
Comment by
ChrisCundy on
SolidGoldMagikarp (plus, prompt generation) ·
2023-02-06T18:28:59.103Z ·
LW ·
GW
Would you be able to elaborate a bit on your process for adversarially attacking the model?
It sounds like a combination of projected gradient descent and clustering? I took a look at the code but a brief mathematical explanation / algorithm sketch would help a lot!
Myself and a couple of colleagues are thinking about this approach to demonstrate some robustness failures in LLMs, it would be great to build off your work.