LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

philippe-chlenski on Transcoders enable fine-grained interpretable circuit analysis for language models

Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs

I see. I was in fact misunderstanding this detail in your training setup. In this case, only engineering considerations really remain: these boil down to incorporating multiple transcoders simultaneously and modeling shifting MLP behavior with transcoders. These seem like tractable, although probably nontrivial and, because of the LLM pretraining objective, quite computationally expensive. If transcoders catch on, I hope to see someone with the compute budget for it run this experiment!

aysja on The Intentional Stance, LLMs Edition

Secondly, following Dennett, the point of modeling cognitive systems according to the intentional stance is that we evaluate them on a behavioral basis and that is all there is to evaluate.

I am confused on this point. Several people have stated that Dennett believes something like this, e.g., Quintin and Nora argue [LW · GW] that Dennett is a goal "reductionist," by which I think they mean something like "goal is the word we use to refer to certain patterns of behavior, but it's not more fundamental than that."

But I don't think Dennett believes this. He's pretty critical of behaviorism, for instance, and his essay Skinner Skinned does a good job, imo, of showing why this orientation is misguided. Dennett believes, I think, that things like "goals," "beliefs," "desires," etc. do exist, just that we haven't found the mechanistic or scientific explanation of them yet. But he doesn't think that explanations of intention will necessarily bottom out in just their outward behavior, he expects such explanations to make reference to internal states as well. Dennett is a materialist, so of course at the end of the day all explanations will be in terms of behavior (inward or outward), on some level, much like any physical explanation is. But that's a pretty different claim from "mental states do not exist."

I'm also not sure if you're making that claim here or not, but curious if you disagree with the above?

yanni-kyriacos on yanni's Shortform

That seems fair enough!

kave on LessOnline Festival Updates Thread

(I would agree-react but I can't actually make it)

markvy on The Solution to Sleeping Beauty

Thanks :) the recalibration may take a while… my intuition is still fighting ;)

algon on The Mom Test: Summary and Thoughts

Thank you for this, I'm conducting user interviews right now and there were some suprising things in your review, as well as obviously good ideas that I would probably have missed. Organizing meetups in the field would not have occured to me, and is a good idea.

andy-arditi on Refusal in LLMs is mediated by a single direction

Check out LEACE (Belrose et al. 2023) - their "concept erasure" is similar to what we call "feature ablation" here.

andy-arditi on Refusal in LLMs is mediated by a single direction

Second question is great. We've looked into this a bit, and (preliminarily) it seems like it's the latter (base models learn some "harmful feature," and this gets hooked into by the safety fine-tuned model). We'll be doing more diligence on checking this for the paper.

dagon on ChristianKl's Shortform

[note: I suspect we mostly agree on the impropriety of open selling and dissemination of this data. This is a narrow objection to the IMO hyperbolic focus on government assault risks. ]

I'm unhappy with the phrasing of "targeted by the Chinese government", which IMO implies violence or other real-world interventions when the major threats are "adversary use of AI-enabled capabilities in disinformation and influence operations." Thanks for mentioning blackmail - that IS a risk I put in the first category, and presumably becomes more possible with phone location data. I don't know how much it matters, but there is probably a margin where it does.

I don't disagree that this purchasable data makes advertising much more effective (in fact, I worked at a company based on this for some time). I only mean to say that "targeting" in the sense of disinformation campaigns is a very different level of threat from "targeting" of individuals for government ops.

andy-arditi on Refusal in LLMs is mediated by a single direction

[Responding to some select points]

1. I think you're looking at the harmful_strings dataset, which we do not use. But in general, I agree AdvBench is not the greatest dataset. Multiple follow up papers (Chao et al. 2024, Souly et al. 2024) point this out. We use it in our train set because it contains a large volume of harmful instructions. But our method might benefit from a cleaner training dataset.

2. We don't use the targets for anything. We only use the instructions (labeled goal in the harmful_behaviors dataset).

5. I think choice of padding token shouldn't matter with attention mask. I think it should work the same if you changed it.

6. Not sure about other empirically studied features that are considered "high-level action features."

7. This is an great and interesting point! @wesg [LW · GW] has also brought this up before! (I wish you would have made this into its own comment, so that it could be upvoted and noticed by more people!)

8. We have results showing that you don't actually need to ablate at all layers - there is a narrow / localized region of layers where the ablation is important. Ablating everywhere is very clean and simple as a methodology though, and that's why we share it here.

As for adding at multiple layers - this probably heavily depends on the details (e.g. which layers, how many layers, how much are you adding, etc).

9. We display the second principle component in the post. Notice that it does not separate harmful vs harmless instructions.

LessWrong 2.0 Reader

Archive

Recent comments