Posts

Comments

Comment by erhora on LLMs can learn about themselves by introspection · 2024-10-18T20:38:35.365Z · LW · GW

This is a super interesting line of work!

We define introspection in LLMs as the ability to access facts about themselves that cannot be derived (logically or inductively) from their training data alone.

The entire model is, in a sense, "logically derived" from its training data, so any facts about its output on certain prompts can also be logically derived from its training data.

Why did you choose to make non-derivability part of your definition? Do you mean something like "cannot be derived quickly, for example without training a whole new model"? I'm worried that your current definition is impossible to satisfy, and that you are setting yourself up for easy criticism because it sounds like you're hypothesising strong emergence, i.e. magic.

Comment by erhora on Near-mode thinking on AI · 2024-08-08T15:26:04.498Z · LW · GW

You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update for AI safety if it worked.

It does seem that for automation to have a disproportionately large impact for AI alignment it would have to be specific to the research methods used in alignment. This may not necessarily mean automating the foundational and conceptual research you mention, but I do think it necessarily does not look like your suggest pipeline.

Two examples might be: a philosophically able LLM that can help you de-confuse your conceptual/foundational ideas; automating mech-interp (e.g. existing work on discovering and interpreting features) in a way that does not generalise well to other AI research directions.