Posts

Control Vectors as Dispositional Traits 2024-06-23T21:34:37.970Z

Comments

Comment by Gianluca Calcagni (gianluca-calcagni) on Introducing SARA: a new activation steering technique · 2024-07-10T13:01:05.228Z · LW · GW

Very happy to support you :)
It took some time to understand your paper, please find below a few comments:
(1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations - something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all the steering techniques are a bit arbitrary at the moment). What's the added value of your operations? Maybe you have some intuition about why your calculation is finding the "correct" amount of steering, and I am curious to know more.
(2) Ethics plays a fundamental role in finding a collective solution to AI safety, but I tend to think that we should solve alignment first. It would be interesting to see your future research going in that direction. I can help brainstorming some topics that have not been exhaustively studied yet. Let me know!

Comment by Gianluca Calcagni (gianluca-calcagni) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-09T11:53:46.152Z · LW · GW

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).

Here a proposal of a possible ELK solution using that approach.

Comment by Gianluca Calcagni (gianluca-calcagni) on When is a mind me? · 2024-07-08T08:01:02.627Z · LW · GW

I am surprised I didn't find any reference to Tim Urban's "Wait But Why" post What Makes You You.

In short, he argues that "you" is your sense of continuity, rather than your physical substance. He also argues that if (somehow) your mind was copied&pasted somewhere else, then a brand new "not-you" would be born - even though it may share 100% of your memory and behaviour.
In that sense, Tim argues that Theseus' ship is always "one" despite all its parts are changed over time. If you were to disassemble and reassemble the ship, it would lose its continuity and it could arguably be considered a different ship.

Comment by Gianluca Calcagni (gianluca-calcagni) on Representation Tuning · 2024-07-05T07:15:09.214Z · LW · GW

Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!

https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

I included a link to your post in mine, because they are strongly connected.

Comment by Gianluca Calcagni (gianluca-calcagni) on Jailbreak steering generalization · 2024-07-04T13:34:34.140Z · LW · GW

Thanks for sharing this research, it's very promising. I am looking into collecting a list of steering vectors that may "force" a model into behaving safely - and I believe this should be included as well.
I'd be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Comment by Gianluca Calcagni (gianluca-calcagni) on Clarifying "AI Alignment" · 2024-07-01T08:16:50.578Z · LW · GW

Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Comment by Gianluca Calcagni (gianluca-calcagni) on Refusal in LLMs is mediated by a single direction · 2024-06-26T18:51:48.073Z · LW · GW

This technique works with more than just refusal-acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

Comment by Gianluca Calcagni (gianluca-calcagni) on Refusal in LLMs is mediated by a single direction · 2024-06-26T12:50:51.644Z · LW · GW

This technique works with more than just refusal/acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

Comment by Gianluca Calcagni (gianluca-calcagni) on Experiments in Evaluating Steering Vectors · 2024-06-26T12:48:57.576Z · LW · GW

After one year, it's been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
If you have the time to read challenge my analysis, I'd be very grateful!

Comment by Gianluca Calcagni (gianluca-calcagni) on Introducing SARA: a new activation steering technique · 2024-06-25T16:54:36.479Z · LW · GW

I am glad I read your post, very relevant for safety! It seems to me that the Steering Directions are a variant of the Control Vectors that I described here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Please can you confirm the two concepts are following essentially the same approach?

I agree with you over the benefits of the technique, it is very promising. If there was a proper analysis of its scalability properties and if there was some way to estimate mathematically its likelihood of success, it would be a dramatic breakthrough