LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
This seems really cool! Filled out an application, though I realized after sending I should probably have included on there that I would need some financial support to be able to attend (both for the ticket itself and for the transportation required to get there). How much of a problem is that likely to be?
andy-arditi on Refusal in LLMs is mediated by a single directionThanks!
We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)
I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction ^r" is the clearest way of describing this method.
andy-arditi on Refusal in LLMs is mediated by a single directionThe most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.
For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
The position indexing assumes the usage of this prompt template, with two new lines appended to the end.
oliver-sourbut on Transformers Represent Belief State Geometry in their Residual Streamthe original 'theorem' was wordcelled nonsense
Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.
I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.
A quick go at it, might have typos.
Suppose we have
and a predictor
with structure
X→YX→RY→S→^Y→R
Then GR trivially says S (predictor state) should model the posterior P(X|Y).
Now if these are all instead processes (time-indexed), we have HMM
and predictor process
with structure
Xt→Xt+1Xt→YtSt−1→StYt→St→^Yt+1→Rt+1Yt+1→Rt+1
Drawing together (Xt+1,Yt+1,^Yt+1,Rt+1) as Gt the 'goal', we have a GR motif
Xt→YtYt→St→GtSt−1→StXt→Gt
so St must model P(Xt|St−1,Yt); by induction that is P(Xt|S0,Y1,...,Yt).
andy-arditi on Refusal in LLMs is mediated by a single directionFor this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.
lc on ShortformEither would just change everything, so any prediction ten years out you basically have to prepend "if AI or gene editing doesn't change everything"
cousin_it on Examples of Highly Counterfactual Discoveries?I think Diffractor's post [AF · GW] shows that logical induction does hit a certain barrier, which isn't quite diagonalization, but seems to me about as troublesome:
cousin_it on Accidental Electronic InstrumentAs the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion... This doesn't seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!
For similar instruments, you've seen the array mbira, right?
For marking the ends of notes, clearly the solution is muting with fingers, but idk how this translates to electronics.
habryka4 on On Not Pulling The Ladder Up Behind YouPromoted to curated: I liked this post. It's not world-shattering, but it feels like a useful reference for a dynamic that I encounter a good amount and does a good job at all the basics. The kind of post that on the margin I would like to see a bunch more off (I wouldn't want it to be the only thing on LessWrong, but it feels like the kind of thing LW used to excel at, and now is only dabbling in, and that seems quite sad).
mir on Introducing AI-Powered Audiobooks of Rational Fiction ClassicsI gave it a try two years ago, and I rly liked the logic lectures early on (basicly a narrativization of HAE101 (for beginners) [? · GW]), but gave up soon after. here are some other parts I lurned valuable stuff fm:
do u have recommendations for other sections u found especially insightfwl or high potential-to-improve-effectiveness? no need to explain, but link is appreciated so I can tk look wo reading whole thing.