LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

A Process for Dealing with Motivated Reasoning
Vaughn Papenhausen (Ikaxas) · 2018-09-03T03:34:11.650Z · comments (7)

Impact Measure Desiderata
TurnTrout · 2018-09-02T22:21:19.395Z · comments (41)

When wishful thinking works
AlexMennen · 2018-09-01T23:43:00.858Z · comments (1)

Cooperative Oracles
Diffractor · 2018-09-01T08:05:55.899Z · comments (9)

Why I Don't Like Scotty Slate Star and His Codex and So Can You
Evan_Gaensbauer · 2018-09-01T07:45:42.155Z · comments (4)

On memetic weapons
ioannes (ioannes_shade) · 2018-09-01T03:25:36.489Z · comments (28)

← previous page (newer posts) · next page (older posts) →

Recent comments

charbel-raphael on AI Safety Camp final presentations

are the talks recorded?

tailcalled on Mechanistically Eliciting Latent Behaviors in Language Models

I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don't necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.

For the past for weeks, I've been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricting oneself to changes that occur along the right singular vectors of this.

My SVD idea might improve things, but I didn't get around to testing it because I eventually decided that it wasn't good enough for my purposes because 1) it wouldn't keep you on-manifold enough because it could still introduce unnatural placement of information and exaggerated features, 2) given that transformers are pretty flexible and you can e.g. swap around layers, it felt unclean to have a method that's this strongly dependent on the layer structure.

A followup idea I've been thinking about but haven't been able to be satisfied with is, projections. Like if you pick some vector u, and project the activations (or weights, in my theory, but you work more with activations so let's consider activations, it should be similar) onto u, and then subtract off that projection from the original activations, you get "the activations with u removed", which intuitively seems like it would better focus on "what the network actually does" as opposed to "what the network could do if you added something more to it".

Unfortunately after thinking for a while, I started thinking this actually wouldn't work. Let's say the activations a = x b + y c, where b is a large activation vector that ultimately doesn't have an effect on the final prediction, and c is a small activation vector that does have an effect. If you have some vector d that the network doesn't use at all, you could then project away sqrt(1/2) (b-d), which would introduce the d vector into the activations.

Another idea I've thought about is, suppose you do SVD of the activations. You would multiply the feature half of the SVD with the weights used to compute the KQV matrices, and then perform SVD of that, which should get you the independent ways that one layer affects the next layer. One thing I in particular wonder about is, if you start doing this from the output layer, and proceed backwards, it seems like this would have the effect of "sparsifying" the network down to only the dimensions which matter for the final output, which seems like it should assist in interpretability and such. But it's not clear it interacts nicely with the residual network element.

igshipilov on Design 2

I love "capsKeys" script for AutoHotkey for Windows, which kinda imitates vim using. It replaces Caps Lock key/button functionality, so Caps Lock starts acting the same way like we use Ctrl, Shift and Alt buttons – let's say, we get fourth auxiliary key.

Just have a look github link from the beginning of this text, it speaks for itself. Personally, I using it for about 10 months and cannot imagine PC experience without this tool.

kabir-kumar-1 on Thoughts on seed oil

I want to know.

tailcalled on Mechanistically Eliciting Latent Behaviors in Language Models

How important is it to use full-blown gradient descent to train them? Could one instead take the first singular vector for the Jacobian between the neural network layers, and get something that works similarly well?

tailcalled on Mechanistically Eliciting Latent Behaviors in Language Models

For the purposes of my prediction market, I'm thinking this probably counts as part of the Shard Theory program? Since Alex Turner was involved as a mentor and activation vectors are kind of shard-theorist and so on?

Possibly this is because being biased in favor of unsupervised learning, but my interest is piqued by this finding.

lblack on Transcoders enable fine-grained interpretable circuit analysis for language models

Nice! We were originally planning to train sparse MLPs like this this week.

Do you have any plans of doing something similar for attention layers? Replacing them with wider attention layers with a sparsity penalty, on the hypothesis that they'd then become more monosemantic?

Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?

cubefox on The formal goal is a pointer

This closely relates to the internalist/description theory of meaning in philosophy. The theory said, if we refer to something, we do so via a mental representation ("meanings are in the head"), which is something we can verbalize as a description. A few decades ago, some philosophers objected that we are often able to refer to things we cannot define, seemingly refuting the internalist theory in favor of an externalist theory ("meanings are not in the head"). For example, we can refer to gold even if we we aren't able to define it via its atomic number.

However, the internalist/description theory only requires that there is some description that identifies gold for us, which doesn't necessarily mean we can directly define what gold is. For example, "the yellow metal that was highly valued throughout history and which chemists call 'gold' in English" would be sufficient to identify gold with a description. Another example: You don't know at all what's in the box in front of you, but you can refer to its contents with "The contents of the box I see in front of me". Referring to things only requires we can describe them at least indirectly.

arturo-macias on No Room for Political Philosophy

Still is something that a conscient being superimpose over reality. It is not “there”. This as true for our moral as our mathematical constructions.

https://www.amazon.com/Morality-Mathematics-Justin-Clarke-Doane/dp/0198823665

cubefox on TurnTrout's shortform feed

For illustration, what would be an example of having different shards for "I get food" () and "I see my parents again" ( $P$ ) compared to having one utility distribution over $F \land P$ , $F \land \neg P$ , $\neg F \land P$ , $\neg F \land \neg P$ ?

LessWrong 2.0 Reader

Archive

Recent comments