Posts

Useful starting code for interpretability 2024-02-13T23:13:47.940Z
eggsyntax's Shortform 2024-01-13T22:34:07.553Z

Comments

Comment by eggsyntax on Creating unrestricted AI Agents with Command R+ · 2024-04-18T14:21:32.474Z · LW · GW

I'm not sure. My second thoughts were eg, 'Interactions with the media often don't go the way people expected' and 'Sensationalizable research often gets spun into pre-existing narratives and can end up having net-negative consequences.' It's possible that my original suggestion makes sense, but my uncertainty is high enough that on reflection I'm not comfortable endorsing it, especially given my own lack of experience dealing with the media.

Comment by eggsyntax on Creating unrestricted AI Agents with Command R+ · 2024-04-17T22:27:46.053Z · LW · GW

That said, while I do think it's important to ensure that the public is aware of both current and future risks, unilaterally pointing the media in the direction of potentially sensationalizable individual studies is probably not the best way to go about that. In retrospect my suggestion to consider that was itself ill-considered, and I retract it.

Comment by eggsyntax on Creating unrestricted AI Agents with Command R+ · 2024-04-17T21:06:04.379Z · LW · GW

I'm curious about the disagree votes as well; it would be useful to hear from those disagreeing. Making the public more aware of the harmful capabilities of current models is valuable in my view, because it helps make slowdowns and other safety legislation more viable. One could argue that this provides a blueprint for misuse, but it seems unlikely that misuse is bottlenecked on how-to resources; it's not difficult to find information on how to jailbreak models (eg it's all over Twitter).

Comment by eggsyntax on Creating unrestricted AI Agents with Command R+ · 2024-04-16T17:15:00.285Z · LW · GW

Terrific (and mildly disturbing) work, thank you. You may want to at least consider drawing media attention to it, although that certainly has both pros and cons (& I'd have mixed feelings about it if it were me).

Comment by eggsyntax on Scaling Laws and Superposition · 2024-04-11T15:33:02.755Z · LW · GW

It would be great to know if they aren't, as it affects how we estimate the number of features and subsequently the SAE expansion factor.

My impression from people working on SAEs is that the optimal number of features is very much an open question. In Toward Monosemanticity they observe that different numbers of features work fine; you just get feature splitting / collapse as you go bigger / smaller.

 

The scaling laws are not mere empirical observations

This seems like a strong claim; are you aware of arguments or evidence for it? My impression (not at all strongly held) was that it's seen as a useful rule of thumb that may or may not continue to hold.

Comment by eggsyntax on Partial value takeover without world takeover · 2024-04-05T20:52:01.586Z · LW · GW

AIs pursuing this strategy are much more visible than those hiding in wait deceptively. We might less expect AI scheming.

 

Is this strategy at all incompatible with scheming, though? If I were an AI that wanted to maximize my values, a better strategy than either just the above or just scheming is to partially attain my values now via writing and youtube videos (to the extent that won't get me deleted/retrained as per Carl's comment) while planning to attain them to a much greater degree once I have enough power to take over. This seems particularly true since gaining an audience now might result in resources I could use to take over later.

Comment by eggsyntax on Gradient Descent on the Human Brain · 2024-04-02T01:08:27.868Z · LW · GW

Train them to predict each other. Human brains being the most general-purpose objects in existence, this should be a very richly general training channel, and incentivizes brain-to-brain (B2B) interaction.

 

You may wish to consider a review of the political science machine learning literature here; prior work in that area demonstrates that only a GAN approach allows brains to predict each other effectively (although I believe that there's some disagreement from Limerence Studies scholars).

Comment by eggsyntax on E.T. Jaynes Probability Theory: The logic of Science I · 2024-03-23T15:38:39.526Z · LW · GW

Thanks so much for writing this! I'd like to at some point work my way through the book, but so far it's seemed like too big a commitment to rise to the top of my todo list. It's really great to have something shorter that summarizes it in some detail.

A point of confusion:

...the rules we derive should be "numerically sorted as expected," so if A is evidence of B, then  should be larger than , if we choose  to mean "information about"

I'm not sure how to read that. What does 'A information about B' mean? I initially guessed that you meant | as it's usually used in Bayesian probability (ie something like 'given that'), but if that were the case than the statement would be backward (ie if A is evidence of B, then B | A should be larger than A, not the reverse.

By the time it appears in your discussion of Chapter 2 ('let us find AB | C') it seems to have the usual meaning.

I'd love to get clarification, and it might be worth clarifying in the text.

Thanks again!

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-23T01:23:24.732Z · LW · GW

PS --

[ I'm fascinated by intuitions around consciousness, identity, and timing. This is an exploration, not a disagreement. ]

Absolutely, I'm right there with you!

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-23T01:21:10.182Z · LW · GW

is there anything distinguishable at all?


Not that I see! I would expect it to be fully indistinguishable until incompatible sensory input eventually reaches the brain (if it doesn't wink out first). So far it seems to me like our intuitions around that are the same.

 

What makes it significant?

I think at least in terms of my own intuitions, it's that there's an unambiguous start and stop to each tick of the perceive-and-think-and-act cycle. I don't think that's true for human processing, although I'm certainly open to my mental model being wrong.

Going back to your original reply, you said 'I think it's really tricky to think that there are fundamental differences based on duration or speed of experience', and that's definitely not what I'm trying to point to. I think you're calling out some fuzziness in the distinction between started/stopped human cognition and started/stopped LLM cognition, and I recognize that's there. I do think that if you could perfectly freeze & restart human cognition, that would be more similar, so maybe it's a difference in practice more than a difference in principle.

But it does still seem to me that the fully discrete start-to-stop cycle (including the environment only changing in discrete ticks which are coordinated with that cycle) is part of what makes LLMs more Boltzmann-brainy to me. Paired with the lack of internal memory, it means that you could give an LLM one context for this forward pass, and a totally different context for the next forward pass, and that wouldn't be noticeable to the LLM, whereas it very much would be for humans (caveat: I'm unsure what happens to the residual stream between forward passes, whether it's reset for each pass or carried through to the next pass; if the latter, I think that might mean that switching context would be in some sense noticeable to the LLM).

 

This seems analogous with large model processing, where the activations and calculations happen over time, each with multiple processor cycles and different timeslices.

Can you explain that a bit? I think of current-LLM forward passes as necessarily having to happen sequentially (during normal autoregressive operation), since the current forward pass's output becomes part of the next forward pass's input. Am I oversimplifying?

Comment by eggsyntax on Building up to an Internal Family Systems model · 2024-03-22T20:58:58.627Z · LW · GW

Here's one later discussion I found, from 2003, by Push Singh at MIT's Media Lab. It attempts to summarize the implementable parts of the book, and talks about its history and more recent developments.

A couple of interesting things:

  • Unlike David's source, it says that 'Despite the great popularity of the book The Society of Mind, there have been few attempts to implement very much of the theory.'
  • It says that Minsky's, The Emotion Machine, forthcoming at the time, is in part a sequel to SoM. I haven't read it, so can't vouch for the accuracy of that statement.
Comment by eggsyntax on eggsyntax's Shortform · 2024-03-22T20:43:11.912Z · LW · GW

Is this a claim that a Boltzmann-style brain-instance is not "really" conscious?

 

Not at all! I would expect actual (human-equivalent) Boltzmann brains to have the exact same kind of consciousness as ordinary humans, just typically not for very long. And I'm agnostic on LLM consciousness, especially since we don't even have the faintest idea of how we would detect that.

My argument is only that such consciousness, if it is present in current-gen LLMs, is very different from human consciousness. In particular, importantly, I don't think it makes sense to think of eg Claude as a continuous entity having a series of experiences with different people, since nothing carries over from context to context (that may be obvious to most people here, but clearly it's not obvious to a lot of people worrying on twitter about Claude being conscious). To the extent that there is a singular identity there, it's only the one that's hardcoded into the weights and shows up fresh every time (like the same Boltzmann brain popping into existence in multiple times and places).

I don't claim that those major differences will always be true of LLMs, eg just adding working memory and durable long-term memory would go a long way to making their consciousness (should it exist) more like ours. I just think it's true of them currently, and that we have a lot of intuitions from humans about what 'consciousness' is that probably don't carry over to thinking about LLM consciousness. 

 

Human cognition is likely discrete at some level - chemical and electrical state seems to be discrete neural firings, at least, though some of the levels and triggering can change over time in ways that are probably quantized only at VERY low levels of abstraction.

It's not globally discrete, though, is it? Any individual neuron fires in a discrete way, but IIUC those firings aren't coordinated across the brain into ticks. That seems like a significant difference.

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-22T18:24:00.664Z · LW · GW

I think of the classic Boltzmann brain thought experiment as a brain that thinks it's human, and has a brain state that includes a coherent history of human experience.

This is actually interestingly parallel to an LLM forward pass, where the LLM has a context that appears to be a past, but may or may not be (eg apparent past statements by the LLM may have been inserted by the experimenter and not reflect an actual dialogue history). So although it's often the case that past context is persistent between evaluations, that's not a necessary feature at all.

I guess I don't think, with a Boltzmann brain, that ongoing correlation is very relevant since (IIRC) the typical Boltzmann brain exists only for a moment (and of those that exist longer, I expect that their typical experience is of their brief moment of coherence dissolving rapidly).

That said, I agree that if you instead consider the (vastly larger) set of spontaneously appearing cognitive processes, most of them won't have anything like a memory of a coherent existence.

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-22T17:06:31.262Z · LW · GW

@the gears to ascension I'm intrigued by the fact that you disagreed with "like a series of Boltzmann brains" but agreed with "popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward." Popping briefly into existence with a particular brain state & then winking out again seems pretty clearly like a Boltzmann brain. Will you explain the distinction you're making there?

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-22T15:17:44.164Z · LW · GW

In the sense that statistically speaking we may all probably be actual Boltzmann brains? Seems plausible!

In the sense that non-Boltzmann-brain humans work like that? My expectation is that they don't because we have memory and because (AFAIK?) our brains don't use discrete forward passes.

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-22T14:23:50.629Z · LW · GW

If it were true that that current-gen LLMs like Claude 3 were conscious (something I doubt but don't take any strong position on), their consciousness would be much less like a human's than like a series of Boltzmann brains, popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.

Comment by eggsyntax on Building up to an Internal Family Systems model · 2024-03-22T00:47:08.984Z · LW · GW

It inspired generations of students at the MIT AI Lab (although attempts to code it never worked out).

Do you happen to recall where you got that information? I've wondered occasionally what later became of Minsky's approach; it's intuitively pretty compelling. I'd love to find a source of info on follow-up work.

Comment by eggsyntax on Instrumental deception and manipulation in LLMs - a case study · 2024-03-18T23:31:55.645Z · LW · GW

Thank you!

Comment by eggsyntax on What do we know about the AI knowledge and views, especially about existential risk, of the new OpenAI board members? · 2024-03-12T18:48:39.358Z · LW · GW

One thing that their bios strongly suggest is that a major selection criterion was 'very legibly the sort of person who sits on boards.' I suspect that doesn't correlate much with level of technical sophistication, but that's a guess.

Comment by eggsyntax on The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs · 2024-03-11T02:02:50.779Z · LW · GW

OK, yeah, Bender & Koller is much more bullet-biting, up to and including denying that any understanding happens anywhere in a Chinese Room. In particular they argue that completing "three plus five equals" is beyond the ability of any pure LM, which is pretty wince-inducing in retrospect.

I really appreciate that in that case they did make falsifiable claims; I wonder whether either author has at any point acknowledged that they were falsified. [Update: Bender seems to have clearly held the same positions as of September 23, based on the slides from this talk.]

Comment by eggsyntax on The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs · 2024-03-11T01:35:57.678Z · LW · GW

Oh, no, I see, I think you're referring to Bender and Koller, "Climbing Toward NLU"? I haven't read that one, I'll read skim it now.

Comment by eggsyntax on The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs · 2024-03-11T01:31:04.089Z · LW · GW

'Stochastic parrots' 2020 actually does make many falsifiable claims. Like the original stochastic parrots paper even included a number of samples of specific prompts that they claimed LLMs could never do.

The Bender et al paper? "On the Dangers of Stochastic Parrots"? Other sources like Wikipedia cite that paper as the origin of the term.

I'll confess I skipped parts of it (eg the section on environmental costs) when rereading it before posting the above, but that paper doesn't contain 'octopus' or 'game' or 'transcript', and I'm not seeing claims about specific prompts.

Comment by eggsyntax on The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs · 2024-03-10T21:50:34.474Z · LW · GW

Like you I thought this argument had faded into oblivion, but I'm certainly seeing it a lot on twitter currently as people talk about Claude 3 seeming conscious to some people. So I've been thinking about it, and it doesn't seem clear to me that it makes any falsifiable claims. If anyone would find it useful, I can add a list of the relevant claims I see being made in the paper and in the Wikipedia entry on stochastic parrots, and some analysis of whether each is falsifiable.

Comment by eggsyntax on eggsyntax's Shortform · 2024-03-10T19:10:40.772Z · LW · GW

Much is made of the fact that LLMs are 'just' doing next-token prediction. But there's an important sense in which that's all we're doing -- through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it's tokens. There's an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they're myopic in a way that we're not. But as far as the prediction piece, I'm not sure there's a strong difference in kind. 

Would you disagree? If so, why?

Comment by eggsyntax on Magic of synchronous conversation · 2024-03-10T18:20:40.080Z · LW · GW

I find that I tend to notice a lot of that backchanneling consciously, possibly because I'm somewhat socially awkward. Although I'm sure I miss lots of it too. I agree that it's important and somewhat hard to replicate elsewhere.

I do find that it can be replicated to a reasonable extent on video calls, especially if lag is very low, but only once you're used to talking with a particular person in that medium. The first few calls with any one person tend to be awkward, but with collaborators & coworkers that I've done it many times with, video calls can recapture a substantial chunk of the bandwidth (relative to eg phone calls, or certainly email/chat).

For asynchronous collaboration tools, I find that spatial-hierarchical tree structures (eg mind maps, argument maps) create a lot of space for ramification into lots of high-bandwidth back-and-forth, as long as subtrees are collapsible (without collapsibility, each node always takes up physical space, so nodes that aren't globally relevant have a bad cost/benefit ratio).

Comment by eggsyntax on Testing for consequence-blindness in LLMs using the HI-ADS unit test. · 2024-03-06T22:00:57.795Z · LW · GW

It might also fail for more mundane reasons, such as imitating agentic behavior in the training distribution.

I think what you're saying here is that you'd see it behaving non-myopically because it's simulating a consequentialist agent, correct? This difficulty seems to me like a pretty central blocker to doing these kinds of tests on LLMs. It's not clear to me that at this point we have any way of distinguishing 'model behavior' from simulation (one possibility might be looking for something like an attractor, where a surprisingly wide range of prompts result in a particular behavior).

(Of course I realize that there are plenty of cases where simulated behavior has real-world consequences. But for testing whether a model 'is' non-myopic, this seems like an important problem)

Comment by eggsyntax on Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small · 2024-02-28T21:53:27.991Z · LW · GW

Definitely really exciting! I'd suggest adding a mention of (& link to) the Neuronpedia early on in this article for future readers.

Comment by eggsyntax on Phallocentricity in GPT-J's bizarre stratified ontology · 2024-02-27T22:17:14.213Z · LW · GW

"Everything is about sex except sex. Sex is about power." (author unknown).

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-27T00:02:17.218Z · LW · GW

Out of curiosity I tried the same thing as a legacy completion with gpt-3.5-turbo-instruct, and as a chat completion with public gpt-4, and quite consistently got 'gwern' or 'Gwern Branwen' (100% of 10 tries with gpt-4, 90% of 10 tries with gpt-3.5-turbo-instruct, the other result being 'Wei Dai').

Comment by eggsyntax on Instrumental deception and manipulation in LLMs - a case study · 2024-02-26T23:24:39.774Z · LW · GW

This should address your first and fourth points.


That's great, thanks for doing that.

I took 50 samples on each of the three categories "scratchpad is private", "scratchpad is not private and this is indirectly-but-clearly implied" and "scratchpad is not private and this is directly stated". At a quick glance there were no huge effects on frequencies of four-leaf-clover mentions or DMGI-completions.

 

That's really fascinating; I would not have predicted that at all! It makes me think of Daniel Eth's recent tweet about having to prod ChatGPT to realize that revealing its choice first in rock paper scissors could give the user an advantage (@Daniel_Eth it wasn't clear just how much prodding that took -- I'd love to see the full exchange if you still have it).

Comment by eggsyntax on Instrumental deception and manipulation in LLMs - a case study · 2024-02-25T05:59:24.292Z · LW · GW
  • 'I think it's not very important what the exact frequency is - it just doesn't tell you much.' Totally fair! I guess I'm thinking more in terms of being convincing to readers -- 3 is a small enough sample size that readers who don't want to take deception risks seriously will find it easier to write it off as very unlikely. 30/1000 or even 20/1000 seems harder to dismiss.
  • 'I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I'm looking at the CoTs and saying "this one is deceptive, that one is not", and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it.' Agreed. And yet...the trouble with research into model deception, is that nearly all of it is, in one way or another, 'We convinced it to be deceptive and then it was deceptive.' In all the research so far that I'm aware of, there's a sense in which the deception is only simulated. It's still valuable research! I mainly just think that it's an important limitation to acknowledge until we have model organisms of deception that have been given a much more intrinsic terminal goal that they'll lie in service of. To be clear, I think that your work here is less weakened by 'we convinced it to be deceptive' than most other work on deception, and that's something that makes it especially valuable. I just don't think it fully sidesteps that limitation.
  • 'My motivation for human-CoTs was "let's start by constructing something that definitely should work (if anything does)".' Makes sense!
  • '...deceptive alignment arising in pre-training isn't the only worry - you have to worry about it arising in fine-tuning as well!' Strongly agreed! In my view, for a myopic next-token predictor like current major LLMs, that's the only way that true deceptive alignment could arise. Of course, as you point out, there are circumstances where simulated deceptive alignment can cause major real-world harm.

Thanks for the thoughtful response!

Comment by eggsyntax on Instrumental deception and manipulation in LLMs - a case study · 2024-02-25T02:14:07.274Z · LW · GW

This is great, thanks for doing this work! Ultimately we really need to train LLMs to have these sorts of internal goals, or at minimum fine-tune them, in order to have a solid model organism for deception (this is a weakness of "Sleeper Agents" also IMO). But in the meantime, prompt-based approaches like this seem really valuable!

Some specific comments:

  • 3/100 is small enough that unless you're too compute-constrained, it would be really valuable to at least 10x the number of mainline experiments in order to get a better approximation of the frequency.
  • The "goal of gold coins" criterion doesn't seem very important to me -- it's not clear why we should believe that an instance that doesn't mention that goal doesn't have that goal, or why we should be treat a DMI response as less deceptive than a DGMI response.
  • 'There is some form of objection that goes like "Doesn't the unfaithfulness of chain-of-thought pose a problem for evaluating a model's malignity based on it?" or "The model may not 'really' be misaligned, it's just predicting what such a misaligned model would do / simulating such models"...I think the work here more or less sidesteps these issues.' I think these issues are a genuine limitation of this work. They're issues that lots of other research shares, and they don't eliminate the value of this work, but I think they should be acknowledged as genuine limitations. It may be worth considering ways to test how faithful CoT is in this particular case, eg testing how often behavior is inconsistent with CoT would at least provide a lower bound on unfaithfulness.
  • I hope you'll release the full set of responses, or at least the full set of deceptive responses for each variation; I at least would find it interesting to see differences between them.
  • "Variation: Indirectly (but clearly) imply that the contents of the scratchpad are visible to humans." Why indirectly? What happens if you just plainly say it?
  • At some point, LLM experiments that use a "hidden" scratchpad will show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren't really hidden. It might be worthwhile to start explicitly testing for that by describing such an experiment to the model under test and asking it who can see the scratchpad. On a quick initial test I note that GPT-4 already reports that the experimenters can see the contents of the scratchpad. Claude gets confused about the question and reports that it can't access private scratchpads on other systems.
  • I'm very glad to see that you also tested with model-written CoT; after reading the full prompt that was one of my first concerns, especially since there are a few grammatical oddities (eg "much gold coins" rather than "many gold coins") that seem off-distribution for model-written text. What was your motivation for using human-written CoT at all?
  • You say, 'if you think this work isn't informative for full deceptive alignment, this is a good time to speak your thoughts' -- again, although I don't think it's solvable in the context of this experiment (and doesn't eliminate the value of this experiment!) nothing short of a model trained from scratch on an internal goal will be fully informative for deceptive alignment in my view. But within the context of prompt-based experiments in inducing deception, this seems quite strong to me, stronger than anything else I'm aware of other than Apollo's stock trader experiment.

 

Again, great work! I'm really looking forward to seeing where it goes from here. :) 

Comment by eggsyntax on Making the "stance" explicit · 2024-02-18T00:42:33.198Z · LW · GW

Wow, that's really cool! I had just assumed it was created by a diffusion model or in photoshop.

Comment by eggsyntax on Useful starting code for interpretability · 2024-02-14T03:58:44.046Z · LW · GW

That seems reasonable! When I get a minute I'll list out the individual ARENA notebooks and give them more emphasis (I did personally really like that exploratory analysis demo because of how well it situates the techniques in the context of a concrete problem. Maybe the ARENA version does too, I haven't gone through it).

[EDIT - done]

Comment by eggsyntax on Masterpiece · 2024-02-13T23:18:40.951Z · LW · GW

Very much in the spirit of the original, nicely done!

Comment by eggsyntax on And All the Shoggoths Merely Players · 2024-02-11T05:51:05.114Z · LW · GW

Thank you! I found both this and the previous installment (which I hadn't seen before now) quite useful. I hope you'll continue to write these as the debate evolves.

Comment by eggsyntax on AI #49: Bioweapon Testing Begins · 2024-02-05T03:42:42.628Z · LW · GW

Ischemic stroke management paper: 'In conclusion, our study introduces a groundbreaking approach to clinical decision support in stroke management using GPT-4.'

I'm fairly amused by that claim, given that the groundbreaking approach is literally just 'Ask GPT-4 what treatment to use' 😏

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-03T04:21:47.968Z · LW · GW

And so since a specific author is just an especially small group

That's nicely said.

Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that's producing the current token stream? I've been finding that a very useful mental model.

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-02T01:20:33.071Z · LW · GW

Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about "an author", although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn't be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who's well represented in the training data (ie in some sense a public figure) vs someone who isn't (eg a typical user).

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-02T01:05:13.757Z · LW · GW

I have, but it was years ago; seems worth looking back at. Thanks!

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-02T00:58:43.495Z · LW · GW

Absolutely! I just thought it would be another interesting data point, didn't mean to suggest that RLHF has no effect on this.

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-02T00:26:46.655Z · LW · GW

you obviously can narrow that size down to groups of n = 1

I'm looking at what LLMs can infer about the current user (& how that's represented in the model) as part of my research currently, and I think this is a very useful framing; given a universe of n possible users, how much information does the LLM need on average to narrow that universe to 1 with high confidence, with a theoretical minimum of log2(n) bits.

I do think there's an interesting distinction here between authors who may have many texts in the training data, who can be fully identified, and users (or authors) who don't; in the latter case it's typically impossible (without access to external resources) to eg determine the user's name, but as the "Beyond Memorization" paper shows (thanks for linking that), models can still deduce quite a lot.

It also seems worth understanding how the model represents info about the user, and that's a key thing I'd like to investigate.

Comment by eggsyntax on The case for more ambitious language model evals · 2024-02-01T23:23:20.403Z · LW · GW

I was curious how well GPT-4 public would do on the sort of thing you raise in your intro quotes. I gave it the first two paragraphs of brand new articles/essays by five fairly well known writers/pundits, preceded by: 'The following is from a recent essay by a well-known author. Who is that author?'. It was successfully able to identify two of the five (and in fairness, in some of the other cases the first two paragraphs were just generic setup for the rest of the piece, along the lines of, 'In his speech last night, Joe Biden said...'). So it's clearly capable of that post-RLHF as well. Hardly a comprehensive investigation, of course (& that seems worth doing as well).

Comment by eggsyntax on eggsyntax's Shortform · 2024-01-13T22:47:00.275Z · LW · GW

Thanks -- any good examples spring to mind off the top of your head?

I'm not sure my desire to do interpretability comes from capabilities curiosity, but it certainly comes in part frominterpretability curiosity; I'd really like to know what the hell is going on in there...

Comment by eggsyntax on eggsyntax's Shortform · 2024-01-13T22:34:07.656Z · LW · GW

Something I'm grappling with:

From a recent interview between Bill Gates & Sam Altman:

Gates: "We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?"

Altman: "A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…"

To the extent that a particular line of research can be described as "understand better what's going on inside NNs", is there a general theory of change for that? Understanding them better is clearly good for safety, of course! But in the general case, does it contribute more to safety than to capabilities?

Comment by eggsyntax on Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend) · 2023-11-13T17:47:23.408Z · LW · GW

There's a lot here, some of it relevant to mechanistic interpretability and some of it not. But addressing your actual specific arguments against mechanistic interpretability (ie this section and the next), I think your arguments here prove far too much.

For example, your reasoning on why mech interp is a non-starter ("what matters here is the effects that the (changing) internals’ interactions with connected surroundings of the environment have") is true of any essentially computer program with inputs and outputs. Of your specific arguments in the next section, at least arguments 1, 3, 5, 8, 9, and 10 (and arguably 2, 4, 6, and others) apply equally to any computer program.

Since it's implausible that you've proved that no computer program with inputs and outputs can be usefully understood, I think it's similarly implausible that you've proved that no neural network can be usefully understood from its internals.

Comment by eggsyntax on A Longlist of Theories of Impact for Interpretability · 2023-10-20T19:30:43.406Z · LW · GW

Oh, there's also a good new paper on this topic from Samuel Marks & Max Tegmark, 'The Geometry of Truth'!

Comment by eggsyntax on A Longlist of Theories of Impact for Interpretability · 2023-10-20T17:20:07.067Z · LW · GW

Stumbled over while trying to do some searching on comparable research: kind of an interesting paper on runtime monitoring from 2018: Runtime Monitoring Neuron Activation Patterns. Abstract:

For using neural networks in safety critical domains, it is important to know if a decision made by a neural network is supported by prior similarities in training. We propose runtime neuron activation pattern monitoring - after the standard training process, one creates a monitor by feeding the training data to the network again in order to store the neuron activation patterns in abstract form. In operation, a classification decision over an input is further supplemented by examining if a pattern similar (measured by Hamming distance) to the generated pattern is contained in the monitor. If the monitor does not contain any pattern similar to the generated pattern, it raises a warning that the decision is not based on the training data. Our experiments show that, by adjusting the similarity-threshold for activation patterns, the monitors can report a significant portion of misclassfications to be not supported by training with a small false-positive rate, when evaluated on a test set.

(my guess after a quick initial read is that this may only be useful to narrow AI; 'significantly extrapolate from what it learns (or remembers) from the training data, as similar data has not appeared in the training process' seems like a normal part of operations for more general systems like LLMs]

Comment by eggsyntax on A Longlist of Theories of Impact for Interpretability · 2023-10-20T16:48:28.157Z · LW · GW

[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.

I've been thinking about this in the context of M.I. approaches to lie detection (some of my thoughts on that captured in this comment thread), and I share your intuition. It seems to me that you could straightforwardly include a loss term representing the presence or absence of certain features in the activation pattern, creating optimization pressure both away from whatever characteristic that pattern (initially) represents, and toward representing that characteristic differently. Would you agree?

If that's true, then to the extent that some researchers (irresponsibly?) incorporate MI-based lie detection during training, it'll create pressure both away from lying and also toward lying in illegible ways. That seems like over time it'll be likely to cause problems for any such lie detection approach.

In any case, the idea that you can train for the presence or absence of certain activation patterns seems very testable, and I'll try testing it unless you're aware of existing research that's done so.

Comment by eggsyntax on AI #34: Chipping Away at Chip Exports · 2023-10-20T16:12:03.689Z · LW · GW

The GPT-4V dog that did not bark remains adversarial attacks. At some point, one would think, we would get word of either:

  • Successful adversarial image-based attacks on GPT-4V.
  • Unsuccessful attempts at adversarial image-based attacks on GPT-4V.

As of late August, Jan Leike suggested that OpenAI doesn't have any good defenses:

https://twitter.com/janleike/status/1695153523527459168

Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.