LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

jkaufman on Accidental Electronic Instrument

I was thinking that finger muting wouldn't be possible, because the sensors are physically damped and there's no vibration left for your fingers to stop. Except now that you mention it, it might still be possible! It could be that gently placing your finger on one of them has a sufficiently recognizable signal that if it's currently "vibrating" and you do that I could treat that as a mute signal.

j-bostock on Biorisk is an Unhelpful Analogy for AI Risk

I agree with this point when it comes to technical discussions. I would like to add the caveat that when talking to a total amateur, the sentence:

AI is like biorisk more than it is like than ordinary tech, therefore we need stricter safety regulations and limits on what people can create at all.

Is the fastest way I've found to transmit information. Maybe 30% of the entire AI risk case can be delivered in the first four words.

aphyer on D&D.Sci Long War: Defender of Data-mocracy

Understood, I'll refrain from posting the solution tomorrow until I've heard from you - if you want more time, let me know and I can push that further back.

yitz on LessWrong Community Weekend 2024, open for applications

This seems really cool! Filled out an application, though I realized after sending I should probably have included on there that I would need some financial support to be able to attend (both for the ticket itself and for the transportation required to get there). How much of a problem is that likely to be?

andy-arditi on Refusal in LLMs is mediated by a single direction

Thanks!

We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)

I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction " is the clearest way of describing this method.

andy-arditi on Refusal in LLMs is mediated by a single direction

The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.

For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:

8B: (position_idx = -1, layer_idx = 12)
70B: (position_idx = -5, layer_idx = 37)

The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

oliver-sourbut on Transformers Represent Belief State Geometry in their Residual Stream

the original 'theorem' was wordcelled nonsense

Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

A quick go at it, might have typos.

Suppose we have

(hidden) state
$Y$ output/observation

and a predictor

$S$ (predictor) state
$^Y$ predictor output
$R$ the reward or goal or what have you (some way of scoring 'was $^Y$ right?')

with structure

$\begin{matrix} X & \to Y X & \to R Y \to S & \to^Y \to R \end{matrix}$

Then GR trivially says $S$ (predictor state) should model the posterior $P (X | Y)$ .

Now if these are all instead processes (time-indexed), we have HMM

$X_{t}$ (hidden) states
$Y_{t}$ observations

and predictor process

$S_{t}$ (predictor) states
${^Y}_{t}$ predictions
$R_{t}$ rewards

with structure

$\begin{matrix} X_{t} & \to X_{t + 1} X_{t} & \to Y_{t} S_{t - 1} & \to S_{t} Y_{t} \to S_{t} & \to {^Y}_{t + 1} \to R_{t + 1} Y_{t + 1} & \to R_{t + 1} \end{matrix}$

Drawing together $(X_{t + 1}, Y_{t + 1}, {^Y}_{t + 1}, R_{t + 1})$ as $G_{t}$ the 'goal', we have a GR motif

$\begin{matrix} X_{t} & \to Y_{t} Y_{t} & \to S_{t} \to G_{t} S_{t - 1} & \to S_{t} X_{t} & \to G_{t} \end{matrix}$

so $S_{t}$ must model $P (X_{t} | S_{t - 1}, Y_{t})$ ; by induction that is $P (X_{t} | S_{0}, Y_{1}, . . ., Y_{t})$ .

andy-arditi on Refusal in LLMs is mediated by a single direction

For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.

lc on Shortform

Either would just change everything, so any prediction ten years out you basically have to prepend "if AI or gene editing doesn't change everything"

cousin_it on Examples of Highly Counterfactual Discoveries?

I think Diffractor's post [AF · GW] shows that logical induction does hit a certain barrier, which isn't quite diagonalization, but seems to me about as troublesome:

As the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion... This doesn't seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!

LessWrong 2.0 Reader

Archive

Recent comments