LessWrong 2.0 Reader
View: New · Old · TopRestrict date range: Today · This week · This month · Last three months · This year · All time
next page (older posts) →
next page (older posts) →
You might be the only person in the history of humanity for whom the so-called "wisdom" tooth has finally done its job.
review-bot on Which skincare products are evidence-based?The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?
lechmazur on William_S's ShortformI know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self's creation
Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I'd give it at least even odds, with most of the remaining chances being meditation or Eastern religions...
This is an interesting question!
I just checked this. The cosine similarity of δ9 and δ22 is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the δ's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).
If you restrict to calculating δ's at just the assistant tag at the end of the prompt, the cosine similarity between δ9 and δ22 goes up to .87.
Interestingly, the cosine similarities in δ's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the δ's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.
daniel-munro on Transformers Represent Belief State Geometry in their Residual StreamFascinating. But are these diagrams really showing HMMs? I thought each state in an HMM had a set of transition probabilities and another set of emission probabilities, which at each step are sampled independently. In these diagrams, the two processes are coupled. If "Even Ys" were a conventional HMM, SE would sometimes emit X and transition to SO, which would result in some even and some odd runs of Y. Are these a special variant of HMM, or some other type of state machine? And would these results apply to conventional HMMs with separate transition and emission probabilities?
andrew-mack on Mechanistically Eliciting Latent Behaviors in Language ModelsYes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).
I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).
andrew-mack on Mechanistically Eliciting Latent Behaviors in Language ModelsThanks for your comment! Yes, I’d say that roughly sums things up.
As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.
Moreover, this might chain well with mechanistic anomaly detection. Quoting the post [AF(p) · GW(p)]: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”
I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).
unnamed on D&D.Sci Long War: Defender of Data-mocracyJust got to this today. I've come up with a candidate solution just to try to survive, but haven't had a chance yet to check & confirm that it'll work, or to try to get clever and reduce the number of soldiers I'm using.
10 Soldiers armed with: 3 AA, 3 GG, 1 LL, 2 MM, 1 RR
I will probably work on this some more tomorrow.
the-gears-to-ascension on Introducing AI Lab Watchcorp, org
raemon on adamzerner's ShortformI maybe want to clarify: there will still be presentations at LessOnline, we're just trying to design the event such that they're clearly more of a secondary thing.