winnie-yang

Posts
Comments

Posts

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM 2024-08-28T08:41:38.967Z

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs 2024-08-22T07:32:07.600Z

Comments

Comment by Winnie Yang (winnie-yang) on Coup probes: Catching catastrophes with probes trained off-policy · 2024-11-23T18:10:13.745Z · LW · GW

to do a do a

There seem to by a typo here :)

Comment by Winnie Yang (winnie-yang) on Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM · 2024-08-28T16:44:56.129Z · LW · GW

Thank you so much for your interest and suggestion! Sorry this is a really rough draft... I didn't have time to polish it yet. This is a good point! I might try make use of Claude's help tonight!

Comment by Winnie Yang (winnie-yang) on Normalizing Sparse Autoencoders · 2024-06-02T23:55:27.591Z · LW · GW

Hi Hengyu! Really nice work here! I am wondering if you have released the pre-trained SAE for llama-2?

User info

Posts

Comments