LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

yitz on LessWrong Community Weekend 2024, open for applications

This seems really cool! Filled out an application, though I realized after sending I should probably have included on there that I would need some financial support to be able to attend (both for the ticket itself and for the transportation required to get there). How much of a problem is that likely to be?

andy-arditi on Refusal in LLMs is mediated by a single direction

Thanks!

We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)

I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction " is the clearest way of describing this method.

andy-arditi on Refusal in LLMs is mediated by a single direction

The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.

For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:

8B: (position_idx = -1, layer_idx = 12)
70B: (position_idx = -5, layer_idx = 37)

The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

oliver-sourbut on Transformers Represent Belief State Geometry in their Residual Stream

the original 'theorem' was wordcelled nonsense

Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

A quick go at it, might have typos.

Suppose we have

(hidden) state
$Y$ output/observation

and a predictor

$S$ (predictor) state
$^Y$ predictor output
$R$ the reward or goal or what have you (some way of scoring 'was $^Y$ right?')

with structure

$\begin{matrix} X & \to Y X & \to R Y \to S & \to^Y \to R \end{matrix}$

Then GR trivially says $S$ (predictor state) should model the posterior $P (X | Y)$ .

Now if these are all instead processes (time-indexed), we have HMM

$X_{t}$ (hidden) states
$Y_{t}$ observations

and predictor process

$S_{t}$ (predictor) states
${^Y}_{t}$ predictions
$R_{t}$ rewards

with structure

$\begin{matrix} X_{t} & \to X_{t + 1} X_{t} & \to Y_{t} S_{t - 1} & \to S_{t} Y_{t} \to S_{t} & \to {^Y}_{t + 1} \to R_{t + 1} Y_{t + 1} & \to R_{t + 1} \end{matrix}$

Drawing together $(X_{t + 1}, Y_{t + 1}, {^Y}_{t + 1}, R_{t + 1})$ as $G_{t}$ the 'goal', we have a GR motif

$\begin{matrix} X_{t} & \to Y_{t} Y_{t} & \to S_{t} \to G_{t} S_{t - 1} & \to S_{t} X_{t} & \to G_{t} \end{matrix}$

so $S_{t}$ must model $P (X_{t} | S_{t - 1}, Y_{t})$ ; by induction that is $P (X_{t} | S_{0}, Y_{1}, . . ., Y_{t})$ .

andy-arditi on Refusal in LLMs is mediated by a single direction

For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.

lc on Shortform

Either would just change everything, so any prediction ten years out you basically have to prepend "if AI or gene editing doesn't change everything"

cousin_it on Examples of Highly Counterfactual Discoveries?

I think Diffractor's post [AF · GW] shows that logical induction does hit a certain barrier, which isn't quite diagonalization, but seems to me about as troublesome:

As the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion... This doesn't seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!

cousin_it on Accidental Electronic Instrument

For similar instruments, you've seen the array mbira, right?

For marking the ends of notes, clearly the solution is muting with fingers, but idk how this translates to electronics.

habryka4 on On Not Pulling The Ladder Up Behind You

Promoted to curated: I liked this post. It's not world-shattering, but it feels like a useful reference for a dynamic that I encounter a good amount and does a good job at all the basics. The kind of post that on the margin I would like to see a bunch more off (I wouldn't want it to be the only thing on LessWrong, but it feels like the kind of thing LW used to excel at, and now is only dabbling in, and that seems quite sad).

mir on Introducing AI-Powered Audiobooks of Rational Fiction Classics

I gave it a try two years ago, and I rly liked the logic lectures early on (basicly a narrativization of HAE101 (for beginners) [? · GW]), but gave up soon after. here are some other parts I lurned valuable stuff fm:

when Keltham said "I do not aspire to be weak."
and from an excerpt he tweeted (idk context):

"if at any point you're calculating how to pessimize a utility function, you're doing it wrong."
Keltham briefly talks about the danger of (what I call) "proportional rewards". I seem to not hv noted down where in the book I read it, but it inspired this note:
- If you're evaluated for whether you're doing your best, you have an incentive to (subconsciously or otherwise) be weaker so you can fake doing your best with less effort. Never encourage people "you did your best!". An objective output metric may be fairer all things considered.
- and furthermore caused me to try harder to eliminate internal excusification-loops in my head. "never make excuses for myself" is my ~3rd Law—and Keltham help me be hyperaware of it.
  - (unrelatedly, my 1st Law is "never make decisions, only ever execute strategies" (origin [EA(p) · GW(p)]).)
- I already had extensive notes on this theme, originally inspired by "Stuck In The Middle With Bruce" (JF Rizzo), but Keltham made me revisit it and update my behaviour further.
- re "handicap incentives", "moralization of effort", "excuses to lose", "incentive to hedge your bets"
I also hv this quoted in my notes, though only to use as diversity/spice for explaining stuff I already had in there (I've placed it under the idionym "tilling the epistemic soil"):
- Keltham > "I'm - actually running into a small stumbling block about trying to explain mentally why it's better to give wrong answers than no answers? It feels too obvious to explain? I mean, I vaguely remember being told about experiments where, if you don't do that, people sort of revise history inside their own heads, and aren't aware of the processes inside themselves that would have produced the previous wrong or suboptimal answer. If you don't make people notice they're confused, they'll go back and revise history and think that the way they already thought would've handled the questions perfectly fine."

do u have recommendations for other sections u found especially insightfwl or high potential-to-improve-effectiveness? no need to explain, but link is appreciated so I can tk look wo reading whole thing.

LessWrong 2.0 Reader

Archive

Recent comments