AISC project: TinyEvals

post by Jett Janiak (jett) · 2023-11-22T20:47:32.376Z · LW · GW · 0 comments

Contents

  Summary
  Motivation
    My theory of change for mech interp
    Fully understanding a model vs reverse engineering specific capabilities
    Why TinyStories?
    Why just doing evaluations for now?
  Steps involved
    Identify capabilities
    Characterise capabilities
    Write the paper
  Risks and downsides
  Acknowledgements
  Team
  Appendix
    Some capabilities observed in TinyStories 1M
None
No comments


Apply to work on this project with me at AI Safety Camp 2024 before 1st December 2023.

The project is not set in stone, I am looking for feedback!

Summary

TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children's stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size.

I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp).

The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models. While in-depth analysis of the underlying circuits is outside the current scope, this project represents an important initial step in that direction.

Gaining a clear picture of the capabilities of these models will encourage the research community to subsequently build on these findings by analysing the responsible circuits. This will further the development of mech interp and provide insights into how language models work internally.

Motivation

My theory of change for mech interp

Fully understanding a model vs reverse engineering specific capabilities

Why TinyStories?

Why just doing evaluations for now?

Steps involved

Identify capabilities

Start with the least capable model.

  1. Collect probabilities of correct next token predictions on the TinyStories validation dataset
  2. Visualise samples where model was correct and confident about next token prediction
  3. Identify the simplest and most common pattern or capability
  4. Filter the samples expressing the identified capability
  5. Go back to step 2

Repeat for the next model, but filter all the cases where both models were correct and confident, to highlight only new capabilities.

Characterise capabilities

For each identified capability:

  1. Try to red-team the capability
    1. What do you think are the characteristics of the text that make the behaviour present?
    2. Are there any such examples where the behaviour is not present?
    3. Does it work on synthetic examples?
    4. What can you change in the text and still see the same results?
  2. Define it as a task, or a set of (prompt, correct_answer) pairs
  3. Evaluate performance of each model on the task by measuring:
    1. Probability of correct_answer
    2. Rank of correct_answer
    3. If there is one obvious wrong answer: Logit difference between correct_answer and wrong_answer
  4. Summarise the results
    1. What is the capability? 
    2. Why was it useful to learn?
    3. Which models are able to perform the task?
    4. Is the performance uniform across all examples? If not, what is different between them?
    5. How could a transformer implement this capability?
    6. Do all of the models perform equally well on the task? If not, why could that be?

Write the paper

The structure and the main message of the paper will depend on the findings. We should start writing in the second month of the project. That will help to consolidate our understanding and direct further research into the most promising directions.

Risks and downsides

There is a small risk that

  1. Tools to identify capabilities we develop will end up supporting capabilities work
  2. Further interpretability research on capabilities we identify will motivate new insights

See Should we publish mechanistic interpretability research? [AF · GW] We will seek senior researchers’ advice before sharing our work widely.

Acknowledgements

I would like to thank @Linda Linsefors [LW · GW], @Arthur Conmy [LW · GW], @Lucia Quirke [LW · GW], and @cmathw [LW · GW] for feedback on this proposal. I would like to thank Lucia Quirke, Lovis Heindrich, and @RGRGRG [LW · GW] for sharing their preliminary research on TinyStories.

Team

Team size: 3-5 people including myself, depending on their time commitment. The problem has a lot of surface area and people can easily work in parallel.

Research Lead
@Jett [LW · GW] (feel free to DM me with any questions)

I participated in MATS winter 2023 cohort, mech interp stream, under Neel Nanda’s mentorship. I co-authored

  1. A circuit for Python docstrings in a 4-layer attention-only transformer [LW · GW]
  2. An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l  
  3. Polysemantic Attention Head in a 4-Layer Transformer [LW · GW

Projects 1 and 3 involved a lot of identification and characterisation of capabilities, similar to what I envision for this AISC project. In projects 2 and 3 I was acting as a research lead / mentor, and I received positive feedback. I commit to working on the project at least 10 hours per week.

Team Coordinator: I prefer some other team member to take on that role.

Skill requirements
Required:

Nice to have:

Nice to haves that I lack:

Appendix

Some capabilities observed in TinyStories 1M

  1. N-grams
    1. Once upon a time
    2. From that day on
    3. avocados
  2. Repeated tokens
    1. a big house with a lot of rooms. The house
    2. there was a big garage [...] they needed more space in the garage
    3. girl named Lily. [...] As they drove down the road, Lily
  3. Repeated multi-token names (induction?)
    1. a big, hairy rabbit named Bongo. Bongo
    2. a little fish named Nemo. One day, Nemo
    3. mouse named Timmy. He lived in a cozy hole in the wall of a big house. Timmy
  4. Common phrases (skip trigrams?)
    1.  see something up close
  5. Plural to singular with different tokenization
    1. went to see the zebras, Lily saw a unique zebra
  6. Understanding that a context was just provided, and it’s time for a story
    1. Once upon a time, in a big forest, there lived a rhinoceros named Roxy. Roxy loved to climb. She climbed trees, rocks, and hills. One
    2. Once upon a time, in a small yard, there was a small daisy. The daisy had a name. Her name was Daisy. Daisy was very small, but she was also very happy.\n
    3. Once upon a time, there was a big, heavy alligator. He lived near a small pond. He was very hungry and wanted to eat something.\n\nOne
  7. Knowing when to end a quote
    1. Kitty smiled and replied, "Thank you, Spot. I polish it every day."
    2. Billy saw that Roxy was sad and asked, "Why are you sad, Roxy?"
    3. The cow said, "I am lonely. I want a friend."
  8. Pronouns
    1. Tim went to his
    2. So, Mia and Tom played together. They
    3. bought some light bulbs. When he came back, he put them
    4. lemon on the ground. He wanted to play with it
  9. Predicting related concepts: bookshelf [...] book, park [...] grass, forgive [...] happy, lunchtime [...] eat, zebra [...] stripes, tree [...] climb, nurse [...] bandage, monkey [...] jungle, shop [...] counter, laundry [...] clothes, inside [...] goodbye, octopus [...] ocean, lost [...] can't, road [...] driving, emergency [...] doctor, milk [...] spilled, hammer [...] screwdriver, Daddy [...] Mommy
  10. Indirect Object Identification
    1. Spot saw the shiny car and said, "Wow, Kitty, your car is so bright and clean!" Kitty smiled and replied, "Thank you, Spot
    2. Buddy kicked the ball with his strong legs. The ball flew into the goal! Spot was so happy. He and Buddy
    3. One sunny day, Amy went to the yard with her friend, Max. Max saw the purple swing and said, "Wow! I want to swing too!" Amy

0 comments

Comments sorted by top scores.