gianluca-calcagni

Posts
Comments

Posts

How Business Solved (?) the Human Alignment Problem 2024-12-31T20:39:59.067Z

I Recommend More Training Rationales 2024-12-31T14:06:44.007Z

Can AI Quantity beat AI Quality? 2024-10-02T15:21:45.711Z

An Opinionated Look at Inference Rules 2024-09-03T13:32:25.978Z

All the Following are Distinct 2024-08-02T16:35:51.815Z

Control Vectors as Dispositional Traits 2024-06-23T21:34:37.970Z

Comments

Comment by Gianluca Calcagni (gianluca-calcagni) on Auditing language models for hidden objectives · 2025-03-15T12:21:00.983Z · LW · GW

I'd like to thank the authors for this, I really appreciate this line of research. I also noticed that it is being discussed on Ars Technica here https://arstechnica.com/ai/2025/03/researchers-astonished-by-tools-apparent-success-at-revealing-ais-hidden-motives/

Some time ago I discussed a "role-based" approach (based on steering vectors rather than prompts; I called the roles "dispositional traits", but it's pretty much the same thing) to buy time while working on true alignment; maybe this approach will achieve true alignment, but (for now) there is no mathematical guarantee it really can!
In case anyone would be interested, here is my post - I am always interested in being challenged. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Comment by Gianluca Calcagni (gianluca-calcagni) on Six Thoughts on AI Safety · 2025-01-30T08:26:35.013Z · LW · GW

How do you envision access and control of an AI that is robustly and reasonably compliant? And in which way would "human values" be involved? I agree with you that they are part of the solution, but I want to compare my beliefs with yours.

Comment by Gianluca Calcagni (gianluca-calcagni) on Six Thoughts on AI Safety · 2025-01-27T09:14:51.071Z · LW · GW

I want to thank the author about this post, that is a very interesting read. I see many comments already and I didn't have the time to read them thoroughly, so my apologies if what I am stating below has been discussed already.

The author's point that I wish to challenge is the third one: "alignment is not about loving humanity; it's about robust reasonable compliance".

I agree with the fact that embedding (in a principled way) "love" for humanity is a bad solution, and that it cannot go in a favourable direction - as, in the best scenario, it will likely disempower us. However, I disagree about the fact that "human values" are not part of the solution either.

Suppose that alignment is solved in the form of "robust reasonable compliance": if human values are not embedded at all, the aligned robots will be aligned to single individuals or single organizations, and the effect is that the AIs will be used to fight proxy wars (not necessarily physical wars) among such organizations. This scenario is encouraging rogue AIs on purpose as a form of deterrent among the parties, in the same way nuclear bombs are built as a form of deterrent. That behaviour is an obvious existential risk. I have written a post about this here.

If you agree with the above, there are only two possible solutions: (1) either AI is monopolized by the UN, or (2) it is decentralized but it shares some form of human values, so that it will refuse to do unsafe or unethical things (not sure if that is even possible though, as I discussed at the end of this post).

Comment by Gianluca Calcagni (gianluca-calcagni) on How Business Solved (?) the Human Alignment Problem · 2025-01-01T13:18:10.759Z · LW · GW

Thanks to anyone that took the time to read and vote this post - regardless if it was a positive or a negative vote, I still appreciate it.

If you happen to downvote me, I'd appreciate it if you could explain the reason why: this is the second time that happens (for one of my previous posts I chose a title that was sounding like clickbait - I then corrected it), and I am curious to understand your feedback this time as well.

The reason why I write posts here from time to time is simply to be challenged and be exposed to different points of view: that cannot happen without an exchange (even if harsh).

Let me take this chance to wish you all a happy new year 2025!
Gianluca

Comment by Gianluca Calcagni (gianluca-calcagni) on LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that. · 2024-11-23T12:08:11.296Z · LW · GW

Hi Andrew, your post is very interesting and it made me think more carefully about the definition of consciousness and how it applies to LLMs. I'd be curious to get your feedback about a post of mine that, in my opinion, is related to yours - I am keen to receive even harsh judgement if you have any!
https://www.lesswrong.com/posts/e9zvHtTfmdm3RgPk2/all-the-following-are-distinct

Comment by Gianluca Calcagni (gianluca-calcagni) on An Opinionated Look at Inference Rules · 2024-09-04T08:34:40.599Z · LW · GW

Something quite unexpected happened in the past 19 hours: since I published this post, I received over 12 downvotes! I wasn't expecting lots of feedback anyway, but this time I was definitely caught by surprise by looking at a negative result.

It's okay if my point of view doesn't resonate with the community (being popular is not the reason why I write here), however I am intrigued by this reaction and I'd like to investigate it.

If you happen to read my post and you decide to downvote it, please proceed - but I'd appreciate it if you could explain the reason why. I m happy to be challenged and I will accept even harsh judgements, if that's how you feel.

Comment by Gianluca Calcagni (gianluca-calcagni) on All the Following are Distinct · 2024-08-06T09:16:15.895Z · LW · GW

Thanks @Gunnar_Zarncke , I appreciate your comment! You correctly identified my goal, I am trying to ground the concepts and build relationships "from the top to the bottom", but I don't think I can succeed alone.

I kindly ask you to provide some challenges: is there any area that you feel "shaky"? Any relation in particular that is too much open to interpretation? Anything obviously missing from the discussion?

Comment by Gianluca Calcagni (gianluca-calcagni) on Instruction-following AGI is easier and more likely than value aligned AGI · 2024-07-13T09:44:16.256Z · LW · GW

Thanks Seth for your post! I believe I get your point, and in fact I made a post that described exactly that approach. in detail I recommend conditioning the model by using an existing technique called control vectors (or steering vectors), that achieves a raw but incomplete form of safety - in my opinion, just enough partial safety to work on solutioning full safety with the help of AIs.

Of course, I am happy to be challenged.

Comment by Gianluca Calcagni (gianluca-calcagni) on Introducing SARA: a new activation steering technique · 2024-07-10T13:01:05.228Z · LW · GW

Very happy to support you :)
It took some time to understand your paper, please find below a few comments:
(1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations - something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all the steering techniques are a bit arbitrary at the moment). What's the added value of your operations? Maybe you have some intuition about why your calculation is finding the "correct" amount of steering, and I am curious to know more.
(2) Ethics plays a fundamental role in finding a collective solution to AI safety, but I tend to think that we should solve alignment first. It would be interesting to see your future research going in that direction. I can help brainstorming some topics that have not been exhaustively studied yet. Let me know!

Comment by Gianluca Calcagni (gianluca-calcagni) on An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 · 2024-07-09T11:53:46.152Z · LW · GW

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).

Here a proposal of a possible ELK solution using that approach.

Comment by Gianluca Calcagni (gianluca-calcagni) on When is a mind me? · 2024-07-08T08:01:02.627Z · LW · GW

I am surprised I didn't find any reference to Tim Urban's "Wait But Why" post What Makes You You.

In short, he argues that "you" is your sense of continuity, rather than your physical substance. He also argues that if (somehow) your mind was copied&pasted somewhere else, then a brand new "not-you" would be born - even though it may share 100% of your memory and behaviour.
In that sense, Tim argues that Theseus' ship is always "one" despite all its parts are changed over time. If you were to disassemble and reassemble the ship, it would lose its continuity and it could arguably be considered a different ship.

Comment by Gianluca Calcagni (gianluca-calcagni) on Representation Tuning · 2024-07-05T07:15:09.214Z · LW · GW

Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!

https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

I included a link to your post in mine, because they are strongly connected.

Comment by Gianluca Calcagni (gianluca-calcagni) on Jailbreak steering generalization · 2024-07-04T13:34:34.140Z · LW · GW

Thanks for sharing this research, it's very promising. I am looking into collecting a list of steering vectors that may "force" a model into behaving safely - and I believe this should be included as well.
I'd be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Comment by Gianluca Calcagni (gianluca-calcagni) on Clarifying "AI Alignment" · 2024-07-01T08:16:50.578Z · LW · GW

Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Comment by Gianluca Calcagni (gianluca-calcagni) on Refusal in LLMs is mediated by a single direction · 2024-06-26T18:51:48.073Z · LW · GW

This technique works with more than just refusal-acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

Comment by Gianluca Calcagni (gianluca-calcagni) on Refusal in LLMs is mediated by a single direction · 2024-06-26T12:50:51.644Z · LW · GW

This technique works with more than just refusal/acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

Comment by Gianluca Calcagni (gianluca-calcagni) on Experiments in Evaluating Steering Vectors · 2024-06-26T12:48:57.576Z · LW · GW

After one year, it's been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
If you have the time to read challenge my analysis, I'd be very grateful!

Comment by Gianluca Calcagni (gianluca-calcagni) on Introducing SARA: a new activation steering technique · 2024-06-25T16:54:36.479Z · LW · GW

I am glad I read your post, very relevant for safety! It seems to me that the Steering Directions are a variant of the Control Vectors that I described here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Please can you confirm the two concepts are following essentially the same approach?

I agree with you over the benefits of the technique, it is very promising. If there was a proper analysis of its scalability properties and if there was some way to estimate mathematically its likelihood of success, it would be a dramatic breakthrough

User info

Posts

Comments