LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I was thinking that finger muting wouldn't be possible, because the sensors are physically damped and there's no vibration left for your fingers to stop. Except now that you mention it, it might still be possible! It could be that gently placing your finger on one of them has a sufficiently recognizable signal that if it's currently "vibrating" and you do that I could treat that as a mute signal.
j-bostock on Biorisk is an Unhelpful Analogy for AI RiskI agree with this point when it comes to technical discussions. I would like to add the caveat that when talking to a total amateur, the sentence:
AI is like biorisk more than it is like than ordinary tech, therefore we need stricter safety regulations and limits on what people can create at all.
Is the fastest way I've found to transmit information. Maybe 30% of the entire AI risk case can be delivered in the first four words.
aphyer on D&D.Sci Long War: Defender of Data-mocracyUnderstood, I'll refrain from posting the solution tomorrow until I've heard from you - if you want more time, let me know and I can push that further back.
yitz on LessWrong Community Weekend 2024, open for applicationsThis seems really cool! Filled out an application, though I realized after sending I should probably have included on there that I would need some financial support to be able to attend (both for the ticket itself and for the transportation required to get there). How much of a problem is that likely to be?
andy-arditi on Refusal in LLMs is mediated by a single directionThanks!
We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)
I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction ^r" is the clearest way of describing this method.
andy-arditi on Refusal in LLMs is mediated by a single directionThe most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.
For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
The position indexing assumes the usage of this prompt template, with two new lines appended to the end.
oliver-sourbut on Transformers Represent Belief State Geometry in their Residual Streamthe original 'theorem' was wordcelled nonsense
Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.
I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.
A quick go at it, might have typos.
Suppose we have
and a predictor
with structure
X→YX→RY→S→^Y→R
Then GR trivially says S (predictor state) should model the posterior P(X|Y).
Now if these are all instead processes (time-indexed), we have HMM
and predictor process
with structure
Xt→Xt+1Xt→YtSt−1→StYt→St→^Yt+1→Rt+1Yt+1→Rt+1
Drawing together (Xt+1,Yt+1,^Yt+1,Rt+1) as Gt the 'goal', we have a GR motif
Xt→YtYt→St→GtSt−1→StXt→Gt
so St must model P(Xt|St−1,Yt); by induction that is P(Xt|S0,Y1,...,Yt).
andy-arditi on Refusal in LLMs is mediated by a single directionFor this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.
lc on ShortformEither would just change everything, so any prediction ten years out you basically have to prepend "if AI or gene editing doesn't change everything"
cousin_it on Examples of Highly Counterfactual Discoveries?I think Diffractor's post [AF · GW] shows that logical induction does hit a certain barrier, which isn't quite diagonalization, but seems to me about as troublesome:
As the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion... This doesn't seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!