LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

lechmazur on William_S's Shortform

I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self's creation

Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I'd give it at least even odds, with most of the remaining chances being meditation or Eastern religions...

andrew-mack on Mechanistically Eliciting Latent Behaviors in Language Models

This is an interesting question!

I just checked this. The cosine similarity of and $δ_{22}$ is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the $δ$ 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).

If you restrict to calculating $δ$ 's at just the assistant tag at the end of the prompt, the cosine similarity between $δ_{9}$ and $δ_{22}$ goes up to .87.

Interestingly, the cosine similarities in $δ$ 's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the $δ$ 's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.

daniel-munro on Transformers Represent Belief State Geometry in their Residual Stream

Fascinating. But are these diagrams really showing HMMs? I thought each state in an HMM had a set of transition probabilities and another set of emission probabilities, which at each step are sampled independently. In these diagrams, the two processes are coupled. If "Even Ys" were a conventional HMM, would sometimes emit X and transition to $S_{O}$ , which would result in some even and some odd runs of Y. Are these a special variant of HMM, or some other type of state machine? And would these results apply to conventional HMMs with separate transition and emission probabilities?

andrew-mack on Mechanistically Eliciting Latent Behaviors in Language Models

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).

I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

andrew-mack on Mechanistically Eliciting Latent Behaviors in Language Models

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.

Moreover, this might chain well with mechanistic anomaly detection. Quoting the post [AF(p) · GW(p)]: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”

I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

unnamed on D&D.Sci Long War: Defender of Data-mocracy

Just got to this today. I've come up with a candidate solution just to try to survive, but haven't had a chance yet to check & confirm that it'll work, or to try to get clever and reduce the number of soldiers I'm using.

10 Soldiers armed with: 3 AA, 3 GG, 1 LL, 2 MM, 1 RR

I will probably work on this some more tomorrow.

the-gears-to-ascension on Introducing AI Lab Watch

corp, org

raemon on adamzerner's Shortform

I maybe want to clarify: there will still be presentations at LessOnline, we're just trying to design the event such that they're clearly more of a secondary thing.

david-james on MIRI announces new "Death With Dignity" strategy

If we know a meteor is about to hit earth, having only D days to prepare what is the rational thing to person P to do?

There are values of P and D, such that throwing a big party party before the earth explodes might be the best choice

adamzerner on adamzerner's Shortform

Virtual watercoolers

As I mentioned in some recent Shortform posts, I recently listened to the Bayesian Conspiracy podcast's episode on the LessOnline festival and it got me thinking.

One thing I think is cool is that Ben Pace was saying how the valuable thing about these festivals isn't the presentations, it's the time spent mingling in between the presentations, and so they decided with LessOnline to just ditch the presentations and make it all about mingling. Which got me thinking about mingling.

It seems plausible to me that such mingling can and should happen more online. And I wonder whether an important thing about mingling in the physical world is that, how do I say this, you're just in the same physical space, next to each other, with nothing else you're supposed to be doing, and in fact what you're supposed to be doing is talking to one another.

Well, I guess you're not supposed to be talking to one another. It's also cool if you just want to hang out and sip on a drink or something. It's similar to the office water cooler: it's cool if you're just hanging out drinking some water, but it's also normal to chit chat with your coworkers.

I wonder whether it'd be good to design a virtual watercooler. A digital place that mimicks aspects of the situations I've been describing (festivals, office watercoolers).

By being available in the virtual watercooler it's implied that you're pretty available to chit chat with, but it's also cool if you're just hanging out doing something low key like sipping a drink.
You shouldn't be doing something more substantial though.
The virtual watercooler should be organized around a certain theme. It should attract a certain group of people and filter out people who don't fit in. Just like festivals and office water coolers.

In particular, this feels to me like something that might be worth exploring for LessWrong.

Note: I know that there are various Slack and Discord groups but they don't meet conditions (1) or (2).

LessWrong 2.0 Reader

Archive

Recent comments