Posts

Confusing the metric for the meaning: Perhaps correlated attributes are "natural" 2024-07-23T12:43:18.681Z
Comparing Quantized Performance in Llama Models 2024-07-15T16:01:24.960Z
AISC 2024 - Project Summaries 2023-11-27T22:32:23.555Z
AISC Project: Modelling Trajectories of Language Models 2023-11-13T14:33:56.407Z
Machine Unlearning Evaluations as Interpretability Benchmarks 2023-10-23T16:33:04.878Z
Ideation and Trajectory Modelling in Language Models 2023-10-05T19:21:07.990Z
LLM Modularity: The Separability of Capabilities in Large Language Models 2023-03-26T21:57:03.445Z
LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space 2023-02-13T18:52:36.689Z
Speculation on Path-Dependance in Large Language Models. 2023-01-15T20:42:48.186Z
Searching for Modularity in Large Language Models 2022-09-08T02:25:31.711Z
What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas. 2022-08-16T02:09:39.635Z
How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It) 2022-08-10T18:14:08.786Z
Translating between Latent Spaces 2022-07-30T03:25:06.935Z
Finding Skeletons on Rashomon Ridge 2022-07-24T22:31:59.885Z

Comments

Comment by NickyP (Nicky) on I found >800 orthogonal "write code" steering vectors · 2024-07-16T11:00:48.840Z · LW · GW

I wonder how much of these orthogonal vectors are "actually orthogonal" once we consider we are adding two vectors together, and that the model has things like LayerNorm.

If one conditions on downstream midlayer activations being "sufficiently different" it seems possible one could find like 10x degeneracy of actual effects these have on models. (A possibly relevant factor is how big the original activation vector is compared to the steering vector?)

Comment by NickyP (Nicky) on Deep Forgetting & Unlearning for Safely-Scoped LLMs · 2023-12-07T13:37:24.394Z · LW · GW

I think there are already some papers doing similar work, though usually sold as reducing inference costs. For example, the MoEfication paper and Contextual Sparsity paper could probably be modified for this purpose.

Comment by NickyP (Nicky) on AISC 2024 - Project Summaries · 2023-11-29T17:52:20.705Z · LW · GW

Sorry! I have fixed this now

Comment by NickyP (Nicky) on AI Safety Camp 2024 · 2023-11-27T22:39:57.351Z · LW · GW

In case anyone finds it difficult to go through all the projects, I have made a longer post where each project title is followed by a brief description, and a list of the main skills/roles they are looking for.

See here: https://www.lesswrong.com/posts/npkvZG67hRvBneoQ9

Comment by NickyP (Nicky) on Which LessWrongers are (aspiring) YouTubers? · 2023-10-23T17:21:55.795Z · LW · GW

Cadenza Labs has some video explainers on interpretability-related concepts: https://www.youtube.com/@CadenzaLabs

For example, an intro to Causal Scrubbing:

Comment by Nicky on [deleted post] 2023-10-04T08:29:09.869Z

Seems to work fine for me, but here are the links to Market One, Market Two and Market Three from the post. (They show % customer funds to be returned, at 46%, 43% and 42% at time of this comment)

Comment by NickyP (Nicky) on Ban development of unpredictable powerful models? · 2023-06-25T12:37:32.442Z · LW · GW

Maybe not fully understanding, but one issue I see is that without requiring "perfect prediction", one could potentially Goodhart on on the proposal. I could imagine something like:

In training GPT-5, add a term that upweights very basic bigram statistics. In "evaluation", use your bigram statistics table to "predict" most topk outputs just well enough to pass.

This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.

Comment by NickyP (Nicky) on LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space · 2023-02-13T22:31:58.195Z · LW · GW

While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.

When I think of distance, I implicitly think Euclidean distance:

But the actual "distance" used for calculating logits looks like this:

Which is a lot more similar to cosine similarity:

I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + directions instead of distances and points.

Comment by NickyP (Nicky) on LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space · 2023-02-13T22:19:04.338Z · LW · GW

This is true. I think that visualising points on a (hyper-)sphere is fine, but it is difficult in practice to parametrise the points that way.

It is more that the vectors on the gpu look like , but the vectors in the model are treated more like 

Comment by Nicky on [deleted post] 2023-01-23T19:19:13.340Z

Thanks for this comment! I think this one of the main concerns I am pointing at.

I think somethings like fiscal aid could work, but have people tried making models for responses to things like this? It feels like with covid the relatively decent response was because the government was both enforcing a temporary policy of lockdown, and was sending checks to adjust things "back to normal" despite this. If job automation is slightly more gradual, on the scale of months to years, and specific only to certain jobs at a time, the response could be quite different, and it might be more likely that things end up poorly.

Comment by Nicky on [deleted post] 2023-01-22T19:26:15.086Z

Yeah, though I think it depends on how many people are able to buy the new goods at a better price. If most well-paid employees (ie: the employees that companies get the most value from automating) no longer have a job, then the number of people who can buy the more expensive goods and services might go down. It seems counter-intuitive to me that GDP if the number of people who lost their jobs is high enough. It feels possible that the recent tech developments was barely net positive to nominal GDP despite rapid improvements, and that fast enough technological process could cause nominal GDP to go in the other direction.

Comment by NickyP (Nicky) on ChatGPT struggles to respond to the real world · 2023-01-13T14:36:08.686Z · LW · GW

I suspect that with a tuned initial prompt that ChatGPT would do much better. For example, something like:

Simulate an assistant on the other end of a phone call, who is helping me to cook a turmeric latte in my kitchen I have never cooked before and need extremelly specific. Only speak one sentence at a time. Only explain one instruction at a time. Never say "and". Please ask clarifying questions if necessary. Only speak one sentence at a time, and await a response. Be explicit about: 
- where I need to go.
- what I need to get
- where I need to bring things

Do you understand? Say "I Accept" and we can begin

I have not fully tested this, but I guess a tuned prompt of this sort would make it possible, though it is not tuned to amswer this way by default. ( ChatGPT can also simulate a virtual linux shell )

In addition, I have found it is much better when you go back and edit the prompt before an incorrect answer as it starts to reference itself a lot. Though I also expect that in this situaton having a reference recipie at the top would be useful.

Comment by NickyP (Nicky) on Searching for Modularity in Large Language Models · 2022-09-09T09:25:46.323Z · LW · GW

Is the idea with the cosine similarity to check whether similar prompt topics consistently end up yielding similar vectors in the embedding space across all the layers, and different topics end up in different parts of embedding space?

Yeah, I would say this is the main idea I was trying to get towards.

If that's the idea, have you considered just logging which attention heads and MLP layers have notably high or notably low activations for different vs. similar topics instead?

I think I probably just look at the activations instead of the output + residual in further analysis, since it wasn't particularly clear in the outputs of the fully-connected layer, or at least find a better metric than Cosine Similarity. Cosine Similarity probably won't be too useful for analysis that is much deeper, but I think it was sort of useful for showing some trends.

I have also tried using a "scaled cosine similarity" metric, which shows essentially the same output, though preserves the relative length. (that is, instead of normalising each vector to 1, I rescaled each vector by the length of the largest vector, such that now the largest vector has length 1 and every other vector is smaller or equal in size).

With this metric, I think the graphs were slightly better, but the cosine similarity plots between different vectors had the behaviour of all vectors being more similar with the longest vector which I though made it more difficult to see the similarity on the graphs for small vectors, and felt like it would be more confusing to add some weird new metric. (Though now writing this, it now seems an obvious mistake that I should have just written the post with "scaled cosine similarity", or possibly some better metric if I could find one, since it seems important here that two basically zero vectors should have a very high similarity, and this isn't captured by either of these metrics). I might edit the post to add some extra graphs in an edited appendix, though this might also go into a separate post.

As for looking at the attention heads instead of the attention blocks, so far I haven't seen that they are a particularly better unit for distinguishing between the different categories of text (though for this analysis so far I only looked at OPT-125M). When looking at outputs of the attention heads, and their cosine similarities, usually it seemed that the main difference was from a specific dimension being particularly bright, rather than attention heads lighting up to specific categories (when looking at the cosine similarity of the attention outputs). The magnitude of the activations also seemed pretty consistent between activation heads in the same layer (and was very small for most of the middle layers), except for the occasional high-magnitude dimension in the layers near the beginning and end.

I made some graphs that sort of show this. The indices 0-99 are the same as in the post.

Here is some results for attention head 5 from the attention block in the final decoder layer for OPT-125M:


The left image is the "scaled cosine similarity" between the (small) vectors (of size 64) put out by each attention head. The second image is the raw/unscaled values of the same output vectors, where each column represents an output vector.

Here are the same two plots, but instead for attention head 11 in the attention block of the final layer for OPT-125M:

I still think there might be some interesting things in the individual attention heads, (most likely in the key-query behaviour from what I have seen so far), but I will need to spend some more time doing analysis.

But if your hypothesis is specifically that there are different modules in the network for dealing with different kinds of prompt topics, that seems directly testable just by checking if some sections of the network "light up" or go dark in response to different prompts. Like a human brain in an MRI reacting to visual vs. auditory data.

This is the analogy I have had in my head when trying to do this, but I think a my methodology has not tracked this as well as I would have preferred. In particular, I still struggle to understand how residual streams can form notions of modularity in networks.

Comment by NickyP (Nicky) on Announcing Encultured AI: Building a Video Game · 2022-08-23T23:33:29.580Z · LW · GW

Maybe you have seen it before, but Veloren looks like a project with people you should talk with. They are building an open source voxel MMO in Rust, and you might be able to collaborate with them. I think most people working on it are doing it as a side hobby project.