LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

hugofry on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

Thanks for the comments! I am also surprised that SAEs trained on these vision models seem to require such little data. Especially as I would have thought the complexity of CLIP's representations for vision would be comparable to the complexity for text (after all we can generate an image from a text prompt, and then use a captioning model to recover the text suggesting most/all of the information in the text is also present in the image).

With regards to the model loss, I used the text template “A photo of a {label}.”, where {label} is the ImageNet text label (this was the template used in the original CLIP paper). These text prompts were used alongside the associated batch of images and passed jointly into the full CLIP model (text and vision models) using the original contrastive loss function that CLIP was trained on. I used this loss calculation (with this template) to measure both the original model loss and the model loss with the SAE inserted during the forward pass.

I also agree completely with your explanation for the reduction in loss. My tentative explanation goes something like this:

Many of the ImageNet classes are very similar (eg 118 classes are of dogs and 18 are of primates). A model such as CLIP that is trained on a much larger dataset may struggle to differentiate the subtle differences in dog breeds and primate species. These classes alone may provide a large chunk of the loss when evaluated on ImageNet.
CLIP's representations of many of these classes will likely be very similar,^[1] using only a small subspace of the residual stream to separate these classes. When the SAE is included during the forward pass, some random error is introduced into the model's activations and so these representations will on average drift apart from each other, separating slightly. This on average will decrease the contrastive loss when restricted to ImageNet (but not on a much larger dataset where the activations will not be clustered in this way).

That was a very hand-wavy explanation but I think I can formalise it with some maths if people are unconvinced by it.

^{^}
I have some data to suggest this is the case even from the perspective of SAE features. The dog SAE features have much higher label entropy (mixing many dog species in the highest activating images) compared to other non-dog classes, suggesting the SAE features struggle to separate the dog species.

arthur-conmy on Refusal in LLMs is mediated by a single direction

The "This should be cited" part of Dan H's comment was edited in after the author's reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.

On the other hand the post's authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).

(The authors made me aware of this fact)

hugofry on Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

Thanks for the feedback! Yeah I was also surprised SAEs seem to work on ViTs pretty much straight out of the box (I didn't even need to play around with the hyper parameters too much)! As I mentioned in the post I think it would be really interesting to train on a much larger (more typical) dataset - similar to the dataset the CLIP model was trained on.

I also agree that I probably should have emphasised the "guess the image" game as a result rather than an aside, I'll bare that in mind for future posts!

davekasten on Screwtape's Shortform

"Achievable goal" or "plausible outcome", maybe?

davekasten on tlevin's Shortform

These are plausible concerns, but I don't think they match what I see as a longtime DC person.

We know that the legislative branch is less productive in the US than it has been in any modern period, and fewer bills get passed (many different metrics for this, but one is https://www.reuters.com/graphics/USA-CONGRESS/PRODUCTIVITY/egpbabmkwvq/) . Those bills that do get passed tend to be bigger swings as a result -- either a) transformative legislation (e.g., Obamacare, Trump tax cuts and COVID super-relief, Biden Inflation Reduction Act and CHIPS) or b) big omnibus "must-pass" bills like FAA reauthorization, into which many small proposals get added in.

I also disagree with the claim that policymakers focus on credibility and consensus generally, except perhaps in the executive branch to some degree. (You want many executive actions to be noncontroversial "faithfully executing the laws" stuff, but I don't see that as "policymaking" in the sense you describe it.)

In either of those, it seems like the current legislative "meta" favors bigger policy asks, not small wins, and I'm having trouble of thinking of anyone I know who's impactful in DC who has adopted the opposite strategy. What are examples of the small wins that you're thinking of as being the current meta?

habryka4 on [deleted]

Don't really think this makes sense as a tag page. Too subjective.

ablue on LLMs could be as conscious as human emulations, potentially

Should it make a difference? Same iterative computation.

Not necessarily, a lot of information is being discarded when you're only looking at the paper/verbal output. As an extreme example, if the emulated brain had been instructed (or had the memory of being instructed) to say the number of characters written on the paper and nothing else, the computational properties of the system as a whole would be much simpler than of the emulation.

I might be missing the point. I agree with you that an architecture that predicts tokens isn't necessarily non-conscious. I just don't think the fact that a system predicts tokens generated by a conscious process is reason to suspect that the system itself is conscious without some other argument.

cheer-poasting on Referential Containment

I think I don't have the correct background to understand fully. However, I think it makes a little more sense than when I originally read it.

An analogue to what you're talking about (referential containment) with the medical knowledge would be something like PCA (principle component analysis) in genomics, right? Just at a much higher, autonomous level.

zach-stein-perlman on Introducing AI Lab Watch

Google sheet.

Some overall scores are one point higher. Probably because my site rounds down. Probably my site should round to the nearest integer...

watermark on LLMs for Alignment Research: a safety priority?

I think it should be a safety priority.

Currently, I'm attempting to make a modularized snapshot of end-to-end research related to alignment (covering code, math, a number of related subjects, diagrams, and answering Q/As) to create custom data, intended to be useful to future me (and other alignment researchers). If more alignment researchers did this, it'd be nice. And if they iterated on how to do it better.

For example, it'd be useful if your 'custom data version of you' broke the fourth wall often and was very willing to assist and over-explain things.

I'm considering going on Lecture-Walks with friends and my voice recorder to world-model dump/explain content so I can capture the authentic [curious questions < - > lucid responses] process

Another thing: It's not that costly to do so - writing about what you're researching is already normal, and making an additional effort to be more explicit/lucid/capture your research tastes (and its evolution) seems helpful

LessWrong 2.0 Reader

Archive

Recent comments