LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

next page (older posts) →

Recent comments

dr_s on My hour of memoryless lucidity

You might be the only person in the history of humanity for whom the so-called "wisdom" tooth has finally done its job.

review-bot on Which skincare products are evidence-based?

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

lechmazur on William_S's Shortform

I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self's creation

Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I'd give it at least even odds, with most of the remaining chances being meditation or Eastern religions...

andrew-mack on Mechanistically Eliciting Latent Behaviors in Language Models

This is an interesting question!

I just checked this. The cosine similarity of and $δ_{22}$ is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the $δ$ 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).

If you restrict to calculating $δ$ 's at just the assistant tag at the end of the prompt, the cosine similarity between $δ_{9}$ and $δ_{22}$ goes up to .87.

Interestingly, the cosine similarities in $δ$ 's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the $δ$ 's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.

daniel-munro on Transformers Represent Belief State Geometry in their Residual Stream

Fascinating. But are these diagrams really showing HMMs? I thought each state in an HMM had a set of transition probabilities and another set of emission probabilities, which at each step are sampled independently. In these diagrams, the two processes are coupled. If "Even Ys" were a conventional HMM, would sometimes emit X and transition to $S_{O}$ , which would result in some even and some odd runs of Y. Are these a special variant of HMM, or some other type of state machine? And would these results apply to conventional HMMs with separate transition and emission probabilities?

andrew-mack on Mechanistically Eliciting Latent Behaviors in Language Models

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).

I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

andrew-mack on Mechanistically Eliciting Latent Behaviors in Language Models

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.

Moreover, this might chain well with mechanistic anomaly detection. Quoting the post [AF(p) · GW(p)]: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”

I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

unnamed on D&D.Sci Long War: Defender of Data-mocracy

Just got to this today. I've come up with a candidate solution just to try to survive, but haven't had a chance yet to check & confirm that it'll work, or to try to get clever and reduce the number of soldiers I'm using.

10 Soldiers armed with: 3 AA, 3 GG, 1 LL, 2 MM, 1 RR

I will probably work on this some more tomorrow.

the-gears-to-ascension on Introducing AI Lab Watch

corp, org

raemon on adamzerner's Shortform

I maybe want to clarify: there will still be presentations at LessOnline, we're just trying to design the event such that they're clearly more of a secondary thing.

LessWrong 2.0 Reader

Archive

Recent comments