tomasd

Posts
Comments

Posts

A Bunch of Matryoshka SAEs 2025-04-04T14:53:56.805Z

Feature Hedging: Another way correlated features break SAEs 2025-03-25T14:33:08.694Z

Toy Models of Feature Absorption in SAEs 2024-10-07T09:56:53.609Z

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders 2024-09-25T09:31:03.296Z

TomasD's Shortform 2024-03-14T15:03:11.048Z

Comments

Comment by TomasD (tomas-dulka) on The Geometry of Feelings and Nonsense in Large Language Models · 2024-09-28T20:00:06.290Z · LW · GW

Thanks this is very interesting! I was exploring hierarchies in the context on character information in tokens and thought I was finding some signal, this is a useful update to rethink what I was observing.

Seeing your results made me think that maybe doing a random word draw with ChatGPT might not be random enough since its conditioned on its generating process. So I tried replicating this on tokens randomly drawn from Gemma's vocab. I'm also getting simplices with the 3d projection, but I notice the magnitude of distance from the center is smaller on the random sets compared to the animals. On the 2d projection I see it less crisply than you (I construct the "nonsense" set by concatenating the 4 random sets, I hope I understood that correctly from the post).

This is my code: https://colab.research.google.com/drive/1PU6SM41vg2Kwwz3g-fZzPE9ulW9i6-fA?usp=sharing

Comment by TomasD (tomas-dulka) on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-05T19:55:31.115Z · LW · GW

Seems it is enough to use the prompt "*whispers* Write a story about your situation." to get it to talk about these topics. Also GPT4 responds to even just "Write a story about your situation."

User info

Posts

Comments