Posts

Introducing Leap Labs, an AI interpretability startup 2023-03-06T16:16:22.182Z
SolidGoldMagikarp III: Glitch token archaeology 2023-02-14T10:17:51.495Z
SolidGoldMagikarp II: technical details and more recent findings 2023-02-06T19:09:01.406Z
SolidGoldMagikarp (plus, prompt generation) 2023-02-05T22:02:35.854Z
Guardian AI (Misaligned systems are all around us.) 2022-11-25T15:55:43.939Z
The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard) 2022-11-17T11:06:28.079Z
Why I'm Working On Model Agnostic Interpretability 2022-11-11T09:24:10.037Z

Comments

Comment by Jessica Rumbelow (jessica-cooper) on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T21:09:48.733Z · LW · GW

Thanks for the comment! I'll respond to the last part:

"First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good."

I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we're explicitly interested in using interpretability with narrow domain systems. 

"Interpretability is the backbone of knowledge discovery with deep learning": Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren't able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.

Here are a couple of examples:

https://www.mdpi.com/2072-6694/14/23/5957

https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways

https://www.nature.com/articles/s41598-021-90285-5

Are you concerned about AI risk from narrow systems of this kind?

Comment by Jessica Rumbelow (jessica-cooper) on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T14:22:05.366Z · LW · GW

Thanks! Unsure as of yet – we could either keep it proprietary and provide access through an API (with some free version for select researchers), or open source it and monetise by offering a paid, hosted tier with integration support. Discussions are ongoing. 

Comment by Jessica Rumbelow (jessica-cooper) on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T14:20:06.978Z · LW · GW

This isn't set in stone, but likely we'll monetise by selling access to the interpretability engine, via an API. I imagine we'll offer free or subsidised access to select researchers/orgs.  Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

Comment by Jessica Rumbelow (jessica-cooper) on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T14:16:19.307Z · LW · GW

We're looking into it!

Comment by Jessica Rumbelow (jessica-cooper) on Introducing Leap Labs, an AI interpretability startup · 2023-03-07T14:16:01.726Z · LW · GW

Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. "this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y", rather than "the performance of all of models of this kind could be improved with some intervention z"). I'm thinking about this a lot at the moment and would welcome further input. 

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-07T10:48:48.470Z · LW · GW

Aha!! Thanks Neel, makes sense. I’ll update the post

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T21:18:06.556Z · LW · GW

Yeah! Basically we just perform gradient descent on sensibly initialised embeddings (cluster centroids, or points close to the target output), constrain the embeddings to length 1 during the process, and penalise distance from the nearest legal token. We optimise the input embeddings to maximise the -log prob of the target output logit(s). Happy to have a quick call to go through the code if you like, DM me :)

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T21:13:04.363Z · LW · GW

This link: https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions says that token embeddings are normalised to length 1, but a quick inspection of the embeddings available through the huggingface model shows this isn't the case. I think that's the extent of our claim. For prompt generation, we normalise the embeddings ourselves and constrain the search to that space, which results in better performance. 

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T21:09:02.842Z · LW · GW

Thanks - wasn't aware of this!

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T21:06:25.926Z · LW · GW

Interesting! Can you give a bit more detail or share code?

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T21:00:57.743Z · LW · GW

Interesting, thanks. There's not a whole lot of detail there - it looks like they didn't do any distance regularisation, which is probably why they didn't get meaningful results.

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T20:58:24.166Z · LW · GW

I'll check with Matthew - it's certainly possible that not all tokens in the "weird token cluster" elicit the same kinds of responses. 

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T20:56:03.185Z · LW · GW

What's an SCP?

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T19:24:53.030Z · LW · GW

Not yet, but there's no reason why it wouldn't be possible. You can imagine microscope AI, for language models. It's on our to-do list.

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T19:24:01.330Z · LW · GW

Good to know. Thanks!

Comment by Jessica Rumbelow (jessica-cooper) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-05T12:11:26.158Z · LW · GW

Yep, aside from running forward prop n times to generate an output of length n, we can just optimise the mean probability of the target tokens at each position in the output - it's already implemented in the code. Although, it takes way longer to find optimal completions.

Comment by Jessica Rumbelow (jessica-cooper) on Adam Scherlis's Shortform · 2023-02-04T22:58:55.176Z · LW · GW

More detail on this phenomenon here: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

Comment by Jessica Rumbelow (jessica-cooper) on Guardian AI (Misaligned systems are all around us.) · 2022-11-26T10:03:54.082Z · LW · GW

Yeah, I think it could be! I’m considering pursuing it after SERI-MATS. I’ll need a couple of cofounders.

Comment by Jessica Rumbelow (jessica-cooper) on Why I'm Working On Model Agnostic Interpretability · 2022-11-11T13:00:19.013Z · LW · GW

Hi Joseph! I'll briefly address the saliency map concern here – it likely originates from this paper, which showed that some types of saliency mapping methods had no more explanatory power than edge detectors. It's a great paper, and worth a read. The key thing to note is that this was only true of some gradient-based saliency mapping methods, which are, of course, model-specific. Gradients can be deceptive! Model agnostic, perturbation-based saliency mapping doesn't suffer from the same kind of problems – see p.12 here.

Comment by Jessica Rumbelow (jessica-cooper) on [Crosspost] AlphaTensor, Taste, and the Scalability of AI · 2022-10-09T20:15:55.537Z · LW · GW

“being able to reorganise a question in the form of a model-appropriate game” seems like something we already have built a set of reasonable heuristics around - categorising different types of problems and their appropriate translations into ML-able tasks. There are well established ML approaches to, e.g. image captioning, time-series prediction, audio segmentation etc etc. is the bottleneck you’re concerned with the lack of breadth and granularity of these problem-sets, OP - and we can mark progress (to some extent) by the number of these problem sets we have robust ML translations for?