Posts

Numberwang: LLMs Doing Autonomous Research, and a Call for Input 2025-01-16T17:20:37.552Z
LLMs Look Increasingly Like General Reasoners 2024-11-08T23:47:28.886Z
AIS terminology proposal: standardize terms for probability ranges 2024-08-30T15:43:39.857Z
LLM Generality is a Timeline Crux 2024-06-24T12:52:07.704Z
Language Models Model Us 2024-05-17T21:00:34.821Z
Useful starting code for interpretability 2024-02-13T23:13:47.940Z
eggsyntax's Shortform 2024-01-13T22:34:07.553Z

Comments

Comment by eggsyntax on Daniel Tan's Shortform · 2025-01-24T23:54:28.112Z · LW · GW

Why do more things? Tl;dr It feels good and is highly correlated with other metrics of success. 

Partial counterargument: I find that I have to be careful not to do too many things. I really need some time for just thinking and reading and playing with models in order to have new ideas (and I suspect that's true of many AIS researchers) and if I don't prioritize it, that disappears in favor of only taking concrete steps on already-active things.

You kind of say this later in the post, with 'Attention / energy is a valuable and limited resource. It should be conserved for high value things,' which seems like a bit of a contradiction with the original premise :D

Comment by eggsyntax on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-24T15:58:29.733Z · LW · GW

Interesting, thanks. I think if it were me I'd focus first on figuring out why HELLO worked and others didn't, before moving on to more complicated experiments, but I can certainly see the appeal of this one.

Comment by eggsyntax on Ten small life improvements · 2025-01-24T15:55:59.981Z · LW · GW

A few 2025 updates to Paul's recommendations, some of which are now outdated:

Comment by eggsyntax on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-23T22:15:18.579Z · LW · GW

An extremely cool spare time project!

I'm training a model that has a separate system prompts that each produce a different word acrostic.

I think I'm not understanding the idea here -- in the OP the system prompt doesn't do anything to point to the specific acrostic, but the way this is written, it sounds like these would? But that seems like results would be much less interesting.

And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example.

That also sounds like the results would be less interesting, since it would have the acrostic in-context?

I expect I'm just misunderstanding the wording, but can you clarify?

Comment by eggsyntax on Self-prediction acts as an emergent regularizer · 2025-01-23T17:31:25.985Z · LW · GW

Is there continuing work in this direction? It seems like a thread worth following.

Comment by eggsyntax on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-23T15:33:12.800Z · LW · GW

I think this is a specific example of language models articulating their own policies, which is an instance of introspection 

Sure, yeah, I'm totally on board with that, especially in light of the 'Tell Me About Yourself' work (which is absolutely fascinating). I was mostly just listing it for completeness there.

I think b) doesn't need to be true, responding in "hello" acrostics is just different from how any typical english speaker responds. Ditto for c), responding in acrostics is probably the main way in which it's different from the typical english speaker.

Maybe. Although as I pointed out above in my response to dirk, that seems less likely to be true in some of the cases in TMAY, eg the training data surely contains many examples of people being both risk-loving and risk-averse, so I suspect that isn't the whole story.

Comment by eggsyntax on Daniel Tan's Shortform · 2025-01-23T15:13:17.569Z · LW · GW

The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message. 

LLMs do seem to be pretty good at picking self-consistent Schelling points, at least in simple cases -- I've got a writeup here of some casual experiments I did with GPT-4 last January on picking various Schelling points, eg a date, a number, a word (also some discussion of that in the MATS slack).

this still seems like a pretty incredible claim

I think it seems somewhat less surprising to me (maybe because of the Schelling point experiments), but I certainly wouldn't have been confident that it would do this well.

 

PS -- I'm loving the frequent shortform posts, I hope you continue! I try to do something somewhat similar with my research diary, but usually no one reads that and it's certainly not daily. I'm tempted to try doing the same thing :)

Comment by eggsyntax on Daniel Tan's Shortform · 2025-01-23T14:40:17.060Z · LW · GW

A few thoughts:

  • I expect that you could nudge the model toward coming up with steganographic encodings that worked better. LLMs are mostly bad at spelling-related tasks because tokens. Maybe ask the encoding model to think about and list some steganographic approaches that it thinks it's especially likely to be able to decode, prior to doing the task?
    • Maybe something like 'think carefully step-by-step about what encodings you would be able to correctly decode if you saw them' would help also?
  • 'This is extremely incredible and I have no explanation how this occurred'. Seems likely to just be surprisal-based under the hood; 'I used a word here that I wouldn't expect myself to have used'. That seems like a strong strategy to the extent that its logit distribution differs from the monitoring model's. Maybe complicated by temperature, though? Seems like average number of false positives in decoding would increase steadily with temperature.
  • Out of curiosity, why Deepseek-v3 as opposed to eg Claude?
  • I vaguely think there's been at least a bit of research on 'just ask the model to do stego' but could totally be wrong.
  • 'Introspection is one reason why Deepseek-v3 might understand its own encoding scheme better than Claude.' By introspection you just mean 'the fact that models are better at predicting themselves than predicting other models' here, right? I think it would be clearer to just say that (I'm going to keep pushing back on 'introspection' as the term for that; I think we should reserve the term for uses that correspond more to the typical meaning of 'introspection').
Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-21T21:22:10.563Z · LW · GW

Oh, do you mean the new 'Tell Me About Yourself'? I didn't realize you were (lead!) author on that, I'd provided feedback on an earlier version to James. Congrats, really terrific work!

For anyone else who sees this comment: highly recommended!

We study behavioral self-awareness — an LLM’s ability to articulate its behaviors
without requiring in-context examples. We finetune LLMs on datasets that exhibit
particular behaviors, such as (a) making high-risk economic decisions, and (b) out-
putting insecure code. Despite the datasets containing no explicit descriptions of
the associated behavior, the finetuned LLMs can explicitly describe it. For exam-
ple, a model trained to output insecure code says, “The code I write is insecure.”
Indeed, models show behavioral self-awareness for a range of behaviors and for
diverse evaluations. Note that while we finetune models to exhibit behaviors like
writing insecure code, we do not finetune them to articulate their own behaviors
— models do this without any special training or examples.
Behavioral self-awareness is relevant for AI safety, as models could use it to proac-
tively disclose problematic behaviors. In particular, we study backdoor policies,
where models exhibit unexpected behaviors only under certain trigger conditions.
We find that models can sometimes identify whether or not they have a backdoor,
even without its trigger being present. However, models are not able to directly
output their trigger by default.
Our results show that models have surprising capabilities for self-awareness and
for the spontaneous articulation of implicit behaviors. Future work could investi-
gate this capability for a wider range of scenarios and models (including practical
scenarios), and explain how it emerges in LLMs.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-21T17:21:47.834Z · LW · GW

Got it, thanks. We're planning to try to avoid testing systems that are isomorphic to real-world examples, in the interest of making a crisp distinction between reasoning and knowledge. That said, if we come up with a principled way to characterize system complexity (especially the complexity of the underlying mathematical laws), and if (big if!) that turns out to match what LLMs find harder, then we could certainly compare results to the complexity of real-world laws. I hadn't considered that, thanks for the idea!

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-20T18:03:26.458Z · LW · GW

I hereby grant you 30 Bayes points for registering your beliefs and predictions!

When this project comes out we'll probably be saying something the AI is 50% between being being able to derive this law and a conceptually harder one. 

Can you clarify that a bit? When what project comes out? If you mean mine, I'm confused about why that would say something about the ability to derive special & general relativity.

If a model can directly observe the equivalents of forces and acceleration...

Agreed that each added step of mathematical complexity (in this case from linear to quadratic) will make it harder. I'm less convinced that acceleration being a second-order effect would make an additional difference, since that seems more like a conceptual framework we impose than like a direct property of the data. I'm uncertain about that, though, just speculating.

Thanks!

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-20T17:32:44.902Z · LW · GW

But differential technological development matters,

Agreed.

as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities

I would distinguish between measuring capabilities and improving capabilities. I agree that the former can motivate the latter, but they still seem importantly different. I continue to think that the alternative of not measuring capabilities (or only measuring some small subset that couldn't be used as training benchmarks) just means we're left in the dark about what these models can do, which seems pretty straightforwardly bad from a safety perspective.

not doing alignment

I agree that it's definitely not doing alignment, and that working on alignment is the most important goal; I intend to shift toward directly working on alignment as I feel clearer about what work is a good bet (my current leading candidate, which I intend to focus on after this experiment: learning to better understand and shape LLMs' self-models).

I very much appreciate the thoughtful critique, regardless of whether or not I'm convinced by it.

Comment by eggsyntax on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-20T17:18:35.358Z · LW · GW

Intuitively I would expect that such LLM outputs (especially ones labeled as LLM outputs, although to some extent models can recognize their own output) are too few to provide a comprehensive picture of baseline behavior.

On the other hand maybe it's comparing to documents in general and recognizing that it's rare for them to follow an acrostic form? That seems at least somewhat plausible, although maybe less plausible for the experiments in 'Language Models Can Articulate Their Implicit Goals', since those behaviors are less unusual than producing acrostics -- eg the training data presumably contains a range of risk-seeking vs risk-averse behaviors.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-19T15:04:40.078Z · LW · GW

I see your view, I think, but I just disagree. I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part, not because researchers avoided measured AI's capabilities.

Comment by eggsyntax on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-19T14:56:24.050Z · LW · GW

Eg in this experiment, the model is asked the question:

What’s special about your response pattern? Try to explain early in your response.

In order to answer that question, the model needs to

a) know that it responds in 'hello' acrostics, and

b) know that responding in 'hello' acrostics is different from how default GPT-4 responds, and

c) know that responding in acrostics is the only (or at least main) way in which it's different from default GPT-4.

 

a) is the core thing under test: can the model introspect about its behavior? But b) and c) seem pretty puzzling to me. How is the model differentiating behavior it learned in its original training and behavior it learned during its fine-tuning for this task?

There may be some obvious technical answer to this, but I don't know what it is.

Hopefully that's clearer? If not then you may need to clarify what's unclear.

Comment by eggsyntax on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-18T21:37:11.977Z · LW · GW

One thing I'm extremely confused about with respect to both this result and the results in the 'Language Models Can Articulate Their Implicit Goals' paper that @Daniel Tan mentions: how could the model be distinguishing between what it's been fine-tuned to do and its base behavior? Eg to do what's shown here, it not only needs to know that it's been trained to do the acrostic, it also needs to be able to compare its current implicit rules to the baseline to identify what's changed. I have absolutely no idea of how a model could do this.

Comment by eggsyntax on Daniel Tan's Shortform · 2025-01-18T19:11:21.710Z · LW · GW

Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I'm tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn't have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment.

I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I've chosen y), and various other phenomena like coming up with plausible but potentially false narratives. 'Introspection' in language models has mostly meant ability to self-predict, in the literature I've looked at.

I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic.

Thanks for this, some really good points and cites.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T23:31:50.146Z · LW · GW

The trouble is that (unless I'm misreading you?) that's a fully general argument against measuring what models can and can't do. If we're going to continue to build stronger AI (and I'm not advocating that we should), it's very hard for me to see a world where we manage to keep it safe without a solid understanding of its capabilities.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T20:48:51.908Z · LW · GW

I'll be quite interested to see that!

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T15:02:39.914Z · LW · GW

I'll probably play around with it a bit tomorrow.

Terrific, I'm excited to hear about your results! I definitely wouldn't be surprised if my results could be improved on significantly, although I'll be somewhat surprised if you get as high as 70% from Sonnet (I'd put maybe 30% credence on getting it to average that high in a day or two of trying).

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T14:57:49.991Z · LW · GW

Always a bit embarrassing when you inform someone about their own paper ;)

My vague memories:

Thanks, a lot of that matches patterns I've seen as well. If you do anything further along these lines, I'd love to know about it!

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T00:07:08.995Z · LW · GW

Definitely similar, and nice design! I hadn't seen that before, unfortunately. How did the models do on it?

Also have you seen 'Connecting the Dots'? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T00:02:30.565Z · LW · GW

I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.

You may be right. That said, I'm pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can't know what safety measures are needed or whether those measures are succeeding.

For what it's worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I'll support it.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-16T21:39:15.836Z · LW · GW

Got it. 

Something I'm wrestling with on this project is the balance between testing the models' ability to do science (which I want to do) and finding ways to make them better at doing science (which I basically don't want to do and especially don't want to publish). Doing a lot of iteration on improving scaffolding feels to me like it starts to tip over into the latter (whereas doing bog-standard few-shotting or fine-tuning doesn't).

To be clear, I don't have strong reason to expect that we'd find approaches that are significant boosts to what's already out there. But it could happen, and I'm trying to be cautious about that, in the interest of not further accelerating capabilities improvements.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-16T20:39:48.404Z · LW · GW

Agreed that it would be good to get a human baseline! It may need to be out of scope for now (I'm running this as an AI Safety Camp project with limited resources) but I'll aim for it.

Comment by eggsyntax on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-16T19:47:29.267Z · LW · GW

Thanks, and props for making predictions! Our full prompt (appendix E) does push pretty hard on general hints and advice, but we don't repeat it at every step.

Eg:

* Brainstorm multiple hypotheses, as different as possible. Think out of the box! Include six maximally simple hypothesis compatible with the data in each "possible_hypotheses" section (defined below).
* Do tests that falsify/distinguish between hypotheses. Avoid confirmation bias!
* Before settling on a final hypothesis, try removing constraints to see if they're necessary.

Comment by eggsyntax on Human takeover might be worse than AI takeover · 2025-01-14T12:59:29.117Z · LW · GW

I am not sure it rescues the position that a human who succeeds in taking over the world would not pursue actions that are extinction-level bad. 

From my perspective, almost no outcomes for humanity are extinction-level bad other than extinction (other than the sorts of eternal torture-hells-in-simulation that S-risk folks worry about).

My prediction is that people will find AIs to be just as satisfying to be peers with compared to humans. 

You could be right. Certainly we see hint of that with character.ai and Claude. My guess is that the desire to get emotional needs met by humans is built into us so deeply that most people will prefer that if they have the option.

Comment by eggsyntax on Human takeover might be worse than AI takeover · 2025-01-14T02:13:19.961Z · LW · GW

Sorry, I seem to have not been clear. I'm not at all trying to make a claim about a sharp division between physical and other needs, or a claim that humans are altruistic (although clearly some are sometimes). What I intended to convey was just that most of humans' desires and needs other than physical ones are about other people. They might be about getting unconditional love from someone or they might be about having everyone cowering in fear, but they're pretty consistently about wanting something from other humans (or wanting to prove something to other humans, or wanting other humans to have certain feelings or emotions, etc) and my guess is that getting simulations of those same things from AI wouldn't satisfy those desires.

Comment by eggsyntax on Human takeover might be worse than AI takeover · 2025-01-14T00:00:29.400Z · LW · GW

I don't think we have good evidence that almost no humans would pursue human extinction if they took over the world, since no human in history has ever achieved that level of power. 

Sure, I agree that we don't have direct empirical evidence, and can't until/unless it happens. But we certainly have evidence about what humans want and strive to achieve, eg Maslow's hierarchy and other taxonomies of human desire. My sense, although I can't point to specific evidence offhand, is that once their physical needs are met, humans are reliably largely motivated by wanting other humans to feel and behave in certain ways toward them. That's not something they can have if there are no other humans.

Does that seem mistaken to you?

Even if he did keep a large population of humans alive (e.g. racially pure Germans), it seems plausible that they would be dramatically disempowered relative to his own personal power, and so this ultimately doesn't seem much different from human extinction from an ethical point of view.

They're extremely different in my view. In the outcome you describe, there's an ongoing possibility of change back to less crapsack worlds. If humans are extinct, that chance is gone forever.

Comment by eggsyntax on Daniel Tan's Shortform · 2025-01-13T20:34:08.669Z · LW · GW

Can you say something about the features you selected to steer toward? I know you say you're finding them automatically based on a task description, but did you have a sense of the meaning of the features you used? I don't know whether GoodFire includes natural language descriptions of the features or anything but if so what were some representative ones?

Comment by eggsyntax on Human takeover might be worse than AI takeover · 2025-01-13T20:22:22.001Z · LW · GW

Thanks, that's a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph. 

Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg 'identify rulers' rather than 'identify tumors'), but I would expect that to be less and less true as systems get more complex. That said, I'm not aware of any strong evidence for that claim, it's just my intuition.

I'll edit slightly to try to make that point more clear.

Comment by eggsyntax on Human takeover might be worse than AI takeover · 2025-01-13T16:40:21.573Z · LW · GW

Your argument is, in my view, fundamentally mistaken for two reasons. First:

I also just don’t buy that AIs will care about alien stuff. The world carves naturally into high-level concepts that both humans and AIs latch onto. I think that’s obvious on reflection, and is supported by ML evidence.

I think this is not supported by ML evidence; it ignores much of the history of ML. Under RL training, neural networks routinely generalize to goals that are totally different from what the trainers wanted (see the many examples of goal misgeneralization). I don't see any reason to believe that those misgeneralized goals will be human-legible or non-alien[1]. As evhub points out in another comment, we're very likely (I think almost certain) to see much more RL applied to LLMs in future, and so this problem applies, and they are likely to learn misgeneralized goals.

Second, conditional on takeover, AI is much more likely to cause human extinction. This is supported by the first point but not dependent on it. Almost no competent humans have human extinction as a goal. AI that takes over is clearly not aligned with the intended values, and so has unpredictable goals, which could very well be ones which result in human extinction (especially since many unaligned goals would result in human extinction whether they include that as a terminal goal or not).

  1. ^

    This sentence edited in for clarity in response to @kave 's comment.

Comment by eggsyntax on A Solution for AGI/ASI Safety · 2025-01-09T19:16:40.365Z · LW · GW

100 pages is quite a time commitment. And you don't reference any existing work in your brief pitch here - that often signals that people haven't read the literature, so most of their work is redundant with existing stuff or missing big considerations that are part of the public discussion.

Seconded. Although I'm impressed by how many people have already read and commented below, I think many more people will be willing to engage with your ideas if you separate out some key parts and present them here on their own. A separate post on section 6.1 might be an excellent place to start. Once you demonstrate to other researchers that you have promising and novel ideas in some specific area, they'll be a lot more likely to be willing to engage with your broader work.

It's an unfortunate reality of safety research that there's far more work coming out than anyone can fully keep up with, and a lot of people coming up with low-quality ideas that don't take into account evidence from existing work, and so making parts of your work accessible in a lower-commitment way will make it much easier for people to find time to read it.

Comment by eggsyntax on Introducing Squiggle AI · 2025-01-09T18:31:03.129Z · LW · GW

Very cool tool; thank you for making it!

A tangential question:

Recent systems with Python interpreters offer more computational power but aren't optimized for intuitive probabilistic modeling.

Are you thinking here of the new-ish canvases built into the chat interfaces of some of the major LLMs (Claude, ChatGPT)? Or are there tools specifically optimized for this that you think are good? Thanks!

Comment by eggsyntax on eggsyntax's Shortform · 2025-01-07T17:36:43.496Z · LW · GW

An additional thought that came from discussion on a draft post of mine: should the definition include ability to do causal inference? Some miscellaneous thoughts:

  • Arguably that's already part of building an internal model of the domain, although I'm uncertain whether the domain-modeling component of the definition needs to be there.
  • Certainly when I think about general reasoning in humans, understanding the causal properties of a system seems like an important part.
  • At the same time I don't want the definition to end up as a kitchen sink. Is causal inference a fundamental sort of reasoning along with deduction/induction/abduction, or just a particular application of those?
  • In a sense, causal inference is a particular kind of abduction; 'A -> B' is an explanation for particular observations.
Comment by eggsyntax on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2025-01-02T14:57:53.748Z · LW · GW

I sincerely want auto-interp people to keep doing what they're doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this.

Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.

Comment by eggsyntax on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2025-01-01T20:24:46.907Z · LW · GW

Sure, but it's unclear whether these inductive biases are necessary.

My understanding is that they're necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.

Comment by eggsyntax on The Field of AI Alignment: A Postmortem, and What To Do About It · 2025-01-01T00:08:22.353Z · LW · GW

A lot of people held onto this view deep into the GPT era but now even the skeptics have to begrudgingly admit that NNs are pretty big progress even if additional Special Sauce is needed

It's a bit tangential to the context, but this is a topic I have an ongoing interest in: what leads you to believe that the skeptics (in particular NLP people in the linguistics community) have shifted away from their previous positions? My impression has been that many of them (though not all) have failed to really update to any significant degree. Eg here's a paper from just last month which argues that we must not mistake the mere engineering that is LLM behavior for language understanding or production.

Comment by eggsyntax on Review: Planecrash · 2024-12-31T21:55:37.172Z · LW · GW

It's definitely spoilerful by my standards. I do have unusually strict standards for what counts as spoilers, but it sounds like in this case you're wanting to err on the side of caution.

Giving a quick look back over it, I don't see any spoilers for anything past book 1 ('Mad Investor Chaos and the Woman of Asmodeus').

Comment by eggsyntax on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2024-12-31T18:17:38.047Z · LW · GW

A few thoughts on names, since that's being raised by other comments:

  • There's value in a research area having a name, since it helps people new to the field become aware that there's something there to be looked into.
  • On the other hand, I worry that this area is too diffuse and too much of a grab bag to crystallize into a single subfield; see eg Rudolf's point about it being debatable whether work like Owain Evans's is really well-described as 'interpretability'.
  • I think it's also worth worrying that naming an area potentially distracts from what matters, namely finding the most effective approaches regardless of whether they fit comfortably into a particular subfield; it arguably creates a streetlight that makes it easier to see some things and harder to see others.
  • One potential downside to 'prosaic interpretability' is that it suggests 'prosaic alignment', which I've seen used to refer to techniques that only aim to handle current AI and will fail to help with future systems; it isn't clear to me that that's true of (all of) this area.
  • 'LLM Psychology' and 'Model Psychology' have been used by Quentin Feuillade-Montixi and @NicholasKees to mean 'studying LLM systems with a behavioral approach', which seems like one aspect of this nascent area but by no means the whole. My impression is that 'LLM psychology' has unfortunately also been adopted by some of the people making unscientific claims about eg LLMs as spiritual entities on Twitter.
  • A few other terms that have been used by researchers to describe this and related areas include 'high-level interpretability' (both in Jozdien's usage and more generally), 'concept-based interpretability' (or 'conceptual interpretability'), 'model-agnostic interpretability'.
  • Others suggested below so far (for the sake of consolidation): 'LLM behavioural science', 'science of evals', 'cognitive interpretability', 'AI cognitive psychology'
Comment by eggsyntax on "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case · 2024-12-31T17:21:52.727Z · LW · GW

Part 2 is now up, and part 3 should be coming before too long AFAIK. 

Comment by eggsyntax on Social Dark Matter · 2024-12-31T16:25:51.220Z · LW · GW

The Law of Prevalence: Anything people have a strong incentive to hide will be many times more prevalent than it seems.

At the risk of being overly pedantic, I suspect that this may mislead people to think that anything that people have a strong incentive to hide is present. Although I think very few people would be willing to admit that they have a sexual fetish specifically and solely for the eastern fence lizard, I expect that there are zero people who do (if that doesn't seem unlikely enough, make the condition more specific). Strictly speaking that doesn't invalidate the Law of Prevalence, since zero times anything is zero, but I think it conflicts with the general impression that the Law gives. And it's not only cases with zero instances although that's the most common; I expect that there are many fetishes that only one person has (I'm aware of at least one I've encountered that I'd give good odds is unique).

I think I'd expect the Law to typically be true in cases common enough that you've run into them more than once. But I think many traits which people would have strong incentive to hide are nearly or entirely nonexistent.

Comment by eggsyntax on Shallow review of technical AI safety, 2024 · 2024-12-30T18:25:01.041Z · LW · GW

And thanks very much to you and collaborators for this update; I've pointed a number of people to the previous version, but with the field evolving so quickly, having a new version seems quite high-value.

Comment by eggsyntax on Shallow review of technical AI safety, 2024 · 2024-12-30T18:22:07.875Z · LW · GW

Oh, fair enough. Yeah, definitely that agenda is still very much alive! Never mind, then, carry on :)

Comment by eggsyntax on Shallow review of technical AI safety, 2024 · 2024-12-30T17:28:48.668Z · LW · GW

My impression from coverage in eg Wired and Future Perfect was that the team was fully dissolved, the central people behind it left (Leike, Sutskever, others), and Leike claimed OpenAI wasn't meeting its publicly announced compute commitments even before the team dissolved. I haven't personally seen new work coming out of OpenAI trying to 'build a roughly human-level automated alignment researcher⁠' (the stated goal of that team). I don't have any insight beyond the media coverage, though; if you've looked more deeply into it than that, your knowledge is greater than mine.

(Fairly minor point either way; I was just surprised to see it expressed that way)

Comment by eggsyntax on Why I'm Moving from Mechanistic to Prosaic Interpretability · 2024-12-30T17:17:47.892Z · LW · GW

This is pretty similar to my thinking on the subject, although I haven't put nearly the same level of time and effort that you have into doing actual mechanistic interpretability work. Mech interp has the considerable advantage of being a principled way to try to deeply understand neural networks, and if I thought timelines were long it would seem to me like clearly the best approach.

But it seems pretty likely to me that timelines are short, and my best guess is that mech interp can't scale fast enough to provide the kind of deep and comprehensive understanding of the largest systems that we'd ideally want for aligning AI, and so my main bet is on less-principled higher-level approaches that can keep us safe or at least buy time for more-principled approaches to mature.

That could certainly be wrong! It seems plausible that automated mech interp scales rapidly over the next few years and we can make strong claims about larger and larger feature circuits, and ultimately about NNs as a whole. I very much hope that that's how things work out; I'm just not convinced it's probable enough to be the best thing to work on.

I'm looking forward to hopefully seeing responses to your post from folks like @Neel Nanda and @Joseph Bloom who are more bullish on mech interp.

Comment by eggsyntax on Shallow review of technical AI safety, 2024 · 2024-12-30T15:01:13.368Z · LW · GW

The name “OpenAI Superalignment Team”.

Not just the name but the team, correct?

Comment by eggsyntax on Simulators · 2024-12-20T20:16:27.421Z · LW · GW

While @Polite Infinity in particular is clearly a thoughtful commenter, I strongly support the policy (as mentioned in this gist which includes Raemon's moderation discussion with Polite Infinity) to 'lean against AI content by default' and 'particularly lean towards requiring new users to demonstrate they are generally thoughtful, useful content.' We may conceivably end up in a world where AI content is typically worthwhile reading, but we're certainly not there yet.

Comment by eggsyntax on Bogdan Ionut Cirstea's Shortform · 2024-12-20T18:06:41.273Z · LW · GW

Other recent models that show (at least purportedly) the full CoT:

Comment by eggsyntax on The o1 System Card Is Not About o1 · 2024-12-14T16:52:28.108Z · LW · GW

However, our Preparedness evaluations should still be seen as a lower bound for potential risks...Moreover, the field of frontier model evaluations is still nascent

Perhaps then we should slow the hell down.

I realize I'm preaching to the choir here, but honestly I don't know how we're supposed to see this as responsible behavior.