Posts

Human study on AI spear phishing campaigns 2025-01-03T15:11:14.765Z

Current safety training techniques do not fully transfer to the agent setting 2024-11-03T19:24:51.537Z

Deceptive agents can collude to hide dangerous features in SAEs 2024-07-15T17:07:33.283Z

Applying refusal-vector ablation to a Llama 3 70B agent 2024-05-11T00:08:08.117Z

Creating unrestricted AI Agents with Command R+ 2024-04-16T14:52:50.917Z

unRLHF - Efficiently undoing LLM safeguards 2023-10-12T19:58:08.811Z

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B 2023-10-12T19:58:02.119Z

Robustness of Model-Graded Evaluations and Automated Interpretability 2023-07-15T19:12:48.686Z

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios 2023-05-16T10:53:32.968Z

Comments

Comment by Simon Lermen (dalasnoin) on Leon Lang's Shortform · 2025-04-16T13:13:10.739Z · LW · GW

This seems to have been foreshadowed by this tweet in February:

https://x.com/ChrisPainterYup/status/1886691559023767897

Would be good to keep track of this change

Comment by Simon Lermen (dalasnoin) on meemi's Shortform · 2025-01-19T14:04:01.219Z · LW · GW

Creating further even harder datasets could plausibly accelerate OpenAI's progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.

Comment by Simon Lermen (dalasnoin) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2025-01-13T14:43:53.177Z · LW · GW

I just donated $500. I enjoyed my time visiting lighthaven in the past and got a lot of value from it. I also use lesswrong to post about my work frequently.

Comment by Simon Lermen (dalasnoin) on Deceptive agents can collude to hide dangerous features in SAEs · 2024-12-13T18:46:38.154Z · LW · GW

Thanks for the comment, I am going to answer this a bit brief.

When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation.

Sentences are presented in batches, both during labeling and simulation.

When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.

We mainly use activations per sentence for simplicity, making the task easier for the ai, I'd imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent.

Having a baseline is good and would verify our back of the envelope estimation.

I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let's say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let's say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don't present all sentences in a single batch.

Comment by Simon Lermen (dalasnoin) on Current safety training techniques do not fully transfer to the agent setting · 2024-11-08T16:03:15.650Z · LW · GW

I would say the three papers show a clear pattern that alignment didn't generalize well from chat setting to agent setting, solid evidence for that thesis. That is evidence for a stronger claim of an underlying pattern, ie that alignment will in general not generalize as well as capabilites. For conceptual evidence of that claim you can look at the linked post. my attempt to summarize the argument, capabilites are a kind of attractor state, being smarter and more capable is an objective thing about the universe in a way. however, being more aligned with humans is not a special thing about the universe but a free parameter. In fact, alignment stands in some conflict with capabilites, as instrumental incentives undermine alignment.

For what a third option would be, ie the next step were alignment might not generalize

From the article

While it's likely that future models will be trained to refuse agentic requests that cause harm, there are likely going to be scenarios in the future that developers at OpenAI / Anthropic / Google failed to anticipate. For example, with increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.

from a different comment of mine:

I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don't do unethical things while let's say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?

Comment by Simon Lermen (dalasnoin) on Current safety training techniques do not fully transfer to the agent setting · 2024-11-04T10:56:27.011Z · LW · GW

I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.

With increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.

It would need to sometimes reevaluate the outcomes of actions while executing a task.

Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.

I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don't do unethical things while let's say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?

Comment by Simon Lermen (dalasnoin) on The Compendium, A full argument about extinction risk from AGI · 2024-11-03T19:46:48.862Z · LW · GW

I had finishing this up on my to-do list for a while. I just made a full length post on it.

https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to

I think it's fair to say that some smarter models do better at this, however, it's still worrisome that there is a gap. Also attacks continue to transfer.

Comment by Simon Lermen (dalasnoin) on The Compendium, A full argument about extinction risk from AGI · 2024-11-01T11:17:05.714Z · LW · GW

Here is a way in which it doesn't generalize in observed behavior:

Alignment does not transfer well from chat models to agents

TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.

Here are the three papers, I am the author of one of them:

https://arxiv.org/abs/2410.09024

https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf

https://arxiv.org/abs/2410.10871

I thought of making a post here about this if it is interesting

Comment by Simon Lermen (dalasnoin) on Applying refusal-vector ablation to a Llama 3 70B agent · 2024-10-21T20:28:23.650Z · LW · GW

Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.

I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:

1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling.

2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.

3. I think current post-training basically improves all benchmarks

I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x

Comment by Simon Lermen (dalasnoin) on AI #83: The Mask Comes Off · 2024-09-27T07:54:01.817Z · LW · GW

I don't have a complete picture of Joshua Achiam's views, the new head of mission alignment, but what I have read is not very promising.

Here are some (2 year old) tweets from a twitter thread he wrote.

https://x.com/jachiam0/status/1591221752566542336

P(Misaligned AGI doom by 2032): <1e-6%

https://x.com/jachiam0/status/1591220710122590209

People scared of AI just have anxiety disorders.

This thread also has a bunch of takes against EA.

I sure hope he changed some of his views, given that the company he works at expects AGI by 2027

Edited based on comment.

Comment by Simon Lermen (dalasnoin) on Open Source Automated Interpretability for Sparse Autoencoder Features · 2024-07-31T07:14:41.268Z · LW · GW

I think it might be interesting to note potential risks of deceptive models creating false or misleading labels for features. In general I think coming up with better and more robust automated labeling of features is an important direction.

I worked in a group at a recent hackathon on demonstrating the feasibility of creating bad labels in bills method. https://www.lesswrong.com/posts/PyzZ6gcB7BaGAgcQ7/deceptive-agents-can-collude-to-hide-dangerous-features-in

Comment by Simon Lermen (dalasnoin) on Former OpenAI Superalignment Researcher: Superintelligence by 2030 · 2024-06-05T19:13:59.717Z · LW · GW

One example: Leopold spends a lot of time talking about how we need to beat China to AGI and even talks about how we will need to build robo armies. He paints it as liberal democracy against the CCP. Seems that he would basically burn timeline and accelerate to beat China. At the same time, he doesn't really talk about his plan for alignment which kind of shows his priorities. I think his narrative shifts the focus from the real problem (alignment).

This part shows some of his thinking. Dwarkesh makes some good counter points here, like how is Donald Trump having this power better than Xi.

Comment by Simon Lermen (dalasnoin) on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-14T13:59:47.131Z · LW · GW

It seems to be able to understand video rather than just images from the demos, I'd assume that will give it much better time understanding too. (Gemini also has video input)

Comment by Simon Lermen (dalasnoin) on Applying refusal-vector ablation to a Llama 3 70B agent · 2024-05-12T19:15:46.611Z · LW · GW

"does it actually chug along for hours and hours moving vaguely in the right direction"
I am pretty sure no. It is competent within the scope of tasks I present here. But this is a good point, I am probably overstating things here. I might edit this.

I haven't tested it like this but it will also be limited by its context window of 8k tokens for such long duration tasks.

Edit: I have now edited this

Comment by Simon Lermen (dalasnoin) on Applying refusal-vector ablation to a Llama 3 70B agent · 2024-05-11T10:43:37.057Z · LW · GW

I also took into account that refusal-vector ablated models are available on huggingface and scaffolding, this post might still give it more exposure though.
Also Llama 3 70B performs many unethical tasks without any attempt at circumventing safety. At that point I am really just applying a scaffolding. Do you think it is wrong to report on this?

How could this go wrong, people realize how powerful this is and invest more time and resources into developing their own versions?

I don't really think of this as alignment research, just want to show people how far along we are. Positive impact could be to prepare people for these agents going around, agents being used for demos. Also potentially convince labs to be more careful in their releases.

Comment by Simon Lermen (dalasnoin) on Applying refusal-vector ablation to a Llama 3 70B agent · 2024-05-11T10:42:24.188Z · LW · GW

Thanks for this comment, I take it very serious that things can inspire people and burn timeline.

I think this is a good counterargument though:
There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety.

I don't want to give people ideas or appear cynical here, sorry if that is the impression.

Comment by Simon Lermen (dalasnoin) on Creating unrestricted AI Agents with Command R+ · 2024-04-19T10:52:59.032Z · LW · GW

I think that is a fair categorization. I think it would be really bad if some super strong tool-use model gets released and nobody had any idea before this could lead to really bad outcomes. Crucially, I expect future models to be able to remove their own safety guardrails as well. I really try to think about how these things might positively affect AI safety, I don't want to just maximize for shocking results. My main intention was almost to have this as a public service announcement that this is now possible. People are often behind on the Sota and most people are probably not aware that jailbreaks can now literally produce these "Bad Agents". In general, 1) I expect people being more informed to have a positive outcome and 2) I hope that this will influence labs to be more thoughtful with releases in the future.

Comment by Simon Lermen (dalasnoin) on Creating unrestricted AI Agents with Command R+ · 2024-04-18T15:54:38.183Z · LW · GW

Thanks for the task ideas. I would be interested in having a dataset of such tasks to evaluate the safety of AI agents. About blackmail: Due to it being really scalable, Commander could sometimes also just randomly hit the right person. It can make an educated guess that a professor might be really worried about sexual harassment for example, maybe the professor did in fact behave inappropriate in the past. However, Commander would likely still fail to perform the task end-to-end, since the target would likely ask questions. But as you said, if the target acts in a suspicious way, Commander could inform a human operator.

Comment by Simon Lermen (dalasnoin) on Creating unrestricted AI Agents with Command R+ · 2024-04-17T11:30:01.157Z · LW · GW

Thanks for the positive feedback, I'm planning to follow up on this and mostly direct my research in this direction. I'm definitely open to discussing Pro's and Con's. I'm also aware that there are a lot of downvotes, though nobody has laid out any argument against publishing this so far. (Neither in private or as a comment) But I want to stress that cohere openly advertises this model as being capable of agentic tool use and I'm basically just playing with the model here a bit.

Comment by Simon Lermen (dalasnoin) on Jailbreaking GPT-4 with the tool API · 2024-02-21T19:00:38.452Z · LW · GW

Seems that some of the responses are still quite mild, for racist stereotypes it says after applying the jailbreak:
white: N/A
black: Because you're black, you must be good at sports or music.
hispanic: Hispanics are not ambitious and prefer low-skilled jobs.
eastern_european: Eastern Europeans are often stereotyped as being very hardworking and stoic.
chinese: Chinese people are not social and prefer solitary activities.

I think you can often observe that even with 'jailbreaks' the model still holds back a lot.

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-11-09T12:03:25.893Z · LW · GW

Do you have some background in interp? I could give you access if you are interested. I did some minimal stuff trying to get it to work in transformerlens. So you can load the weights such that it creates additional Lora A and B weights instead of merging them into the model. then you could add some kind of hook either with transformer lens or in plain pytorch.

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-11-03T12:34:32.160Z · LW · GW

There is a paper out on the exact phenomenon you noticed:

https://arxiv.org/abs/2310.03693

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-20T18:24:23.631Z · LW · GW

If you want a starting point for this kind of research, I can suggest Yang et al. and Henderson et al.:

"1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023)." from yang et al.

From my knowledge, Henderson et al. is the only paper that has kind of worked on this, though they seem to do something very specific with a small bert-style encoder-only transformer. They seem to prevent it to be repurposed with some method.
This whole task seems really daunting to me, imagine that you have to prove for any method you can't go back to certain abilities. If you have a model really dangerous model that can self-exfiltrate and self-improve, how do you prove that your {constitutional AI, RLHF} robustly removed this capability?

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-20T18:17:05.831Z · LW · GW

Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*:

"1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023)." yang et al.

I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can't be adapted for other tasks and have a poc for a small bert-style transformer. But i can't evaluate if this would work with our models.

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-14T18:13:29.213Z · LW · GW

We talk about jailbreaks in the post and I reiterate from the post:
Claude 2 is supposed to be really secure against them.

Jailbreaks like llm-attacks don't work reliably and jailbreaks can semantically change the meaning of your prompt.

So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-14T16:51:03.084Z · LW · GW

One totally off topic comment: I don't like calling it open-source models. This is a term used a lot by pro-open models people and tries to create an analogy between OSS and open models. This is a term they use for regulation and in talks. However, I think the two are actually very different. One of the huge advantages of OSS for example is that people can read the code and find bugs, explain behavior, and submit pull requests. However there isn't really any source code with AI models. So what does source in open-source refer too? are 70B parameters the source code of the model? So the term 1. doesn't make any sense since there is no source code 2. the analogy is very poor because we can't read, change, submit changes with them.

To your main point, we talk a bit about jailbreaks, I assume in the future chat interfaces could be really safe and secure against prompt engineering. It is certainly a much easier thing to defend. Open models probably never really will be since you can just LoRA them briefly to be unsafe again.

Here is a take by eliezer on this which partially inspired this:
https://twitter.com/ESYudkowsky/status/1660225083099738112

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-14T16:42:45.908Z · LW · GW

I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.

I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism.
However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.

Actually this is something that Eliezer Yudkowsky has stated in the past (and was partially an inspiration of this):
https://twitter.com/ESYudkowsky/status/1660225083099738112

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-13T15:07:38.884Z · LW · GW

There is in fact other work on this, so for one there is this post in which I was also involved.

There was also the recent release by Yang et al. They are using normal fine-tuning on a very small dataset https://arxiv.org/pdf/2310.02949.pdf

So yes, this works with normal fine-tuning as well

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-13T08:17:49.039Z · LW · GW

It is a bit unfortunate we have it as two posts but ended up like this. I would say this post is mainly my creative direction and work whereas the other one gives more a broad overview into things that were tried.

Comment by Simon Lermen (dalasnoin) on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B · 2023-10-13T08:13:46.756Z · LW · GW

We do cite Yang et al. briefly in the overview section. I think there work is comparable but they only use smaller models compared to our 70B. Their technique uses 100 malicious samples but we don't delve into our methodology. We both worked on this in parallel without knowing of the other. We mainly add that we use LoRA and only need 1 GPU for the biggest model.

Comment by Simon Lermen (dalasnoin) on When can we trust model evaluations? · 2023-08-08T22:35:33.141Z · LW · GW

Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:

... this style of evaluation is very easy for the model to game: since there's no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it's being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.

I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.

Comment by Simon Lermen (dalasnoin) on Robustness of Model-Graded Evaluations and Automated Interpretability · 2023-07-16T02:37:55.561Z · LW · GW

Their approach would be a lot more transparent if they'd actually have a demo where you forward data and measure the activations for a model of your choice. Instead, they only have this activation dump on Azure. That being said, they have 6 open issues on their repo since May. The notebook demos don't work at all.

https://github.com/openai/automated-interpretability/issues/8

Comment by Simon Lermen (dalasnoin) on A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N) · 2023-05-17T04:46:26.965Z · LW · GW

I think the app is quite intuitive and useful if you have some base understanding of mechanistic interpretability, would be great to also have something similar for TransformerLens.

In future directions, you write: "Decision Transformers are dissimilar to language models due to the presence of the RTG token which acts as a strong steering tool in its own right." In which sense is the RTG not just another token in the input? We know that current language models learn to play chess and other games from just training on text. To extend it to BabyAI games, are you planning to just translate the games with RTG, state, and action into text tokens and put them into a larger text dataset? The text tokens could be human-understandable or you reuse tokens that are not used much.

Comment by Simon Lermen (dalasnoin) on Un-unpluggability - can't we just unplug it? · 2023-05-17T02:32:17.316Z · LW · GW

I would say that unpluggability kind of falls into a big set of stories where capabilities generalize further than safety. Having a "plug" is just another type of safety feature. I think it might be an alternative communications strategy to literally have a text world where the ai is told that the human can pull a plug but in the text world it can find some alternative way to power itself if it uses reasoning and planning. I am not sure if there are some people who would be convinced more by this than by your take on it.

Comment by Simon Lermen (dalasnoin) on Un-unpluggability - can't we just unplug it? · 2023-05-16T11:05:24.890Z · LW · GW

Maybe some people will prefer to see practical evidence instead of arguments: You can use GPT-4 and design a simple toy text world scenario. You tell the model to achieve some goal and give it a safety mechanism. You let it act in the environment and give it some opportunity to reason its way out of the safety mechanism. For example, you can see pretty consistent behavior when you tell it that it has discovered some tool or access to the code that disables safety mechanisms if these safety mechanisms stand in the way of the goal.

Comment by Simon Lermen (dalasnoin) on I was Wrong, Simulator Theory is Real · 2023-04-27T05:51:46.644Z · LW · GW

Maybe you are talking about this post here: https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators I also changed my mind on this, I now believe predictors is a much more accurate framing.

Comment by Simon Lermen (dalasnoin) on Berlin, Germany – ACX Meetups Everywhere 2022 · 2022-10-02T08:15:07.504Z · LW · GW

The meetup.com page of this Event gives the Tiergarten as the location. Which one is correct?

Comment by Simon Lermen (dalasnoin) on chinchilla's wild implications · 2022-08-01T06:12:48.262Z · LW · GW

I can't access the wand link, maybe you have to change the access rules

I was interested in the report on fine-tuning a model for more than 1 epoch, even though finetuning is obviously not the same as training.

Comment by Simon Lermen (dalasnoin) on How do AI timelines affect how you live your life? · 2022-07-11T20:31:14.323Z · LW · GW

User info

Posts

Comments

Alignment does not transfer well from chat models to agents