Seeking Feedback on My Mechanistic Interpretability Research Agenda 2023-09-12T18:45:08.902Z
Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2) 2023-07-28T20:44:36.868Z
Best Ways to Try to Get Funding for Alignment Research? 2023-04-04T06:35:05.356Z


Comment by RGRGRG on There should be more AI safety orgs · 2023-09-22T00:48:58.602Z · LW · GW

For any potential funders reading this:  I'd be open to starting an interpretability lab and would love to chat.  I've been full-time on MI for about 4 months - here is some of my work:

I have a few PhD friends who are working for software jobs they don't like and would be interested in joining me for a year or longer if there were funding in place (even for just the trial period Marius proposes).

My very quick take is that interpretability has yet to understand small language models and this is a valuable direction to focus on next.  (more details here: ) 


For any potential cofounders reading this, I have applied to a few incubators and VC funds, without any success.  I think some applications would be improved if I had a co-founder.  If you are potentially interested in cofounding an interpretability startup and you live in the Bay Area, I'd love to meet for coffee and see if we have a shared vision and potentially apply to some of these incubators together.

Comment by RGRGRG on What I would do if I wasn’t at ARC Evals · 2023-09-09T22:23:09.290Z · LW · GW

I really like your ambitious MI section and I think you hit on a few interesting questions I've come across elsewhere:

Two researchers interpreted a 1-layer transformer network and then I interpreted it differently - there isn't a great way to compare our explanations (or really know how similar vs different our explanations are).

With papers like the Hydra effect that demonstrate similar knowledge can be spread throughout a network, it's not clear to if we want to/how to analyze impact - can/should we jointly ablate multiple units across different heads at once?

I'm personally unsure how to split my time between interpreting small networks vs larger ones.  Should I focus 100% on interpreting 1-2 layer TinyStories LMs or is looking into 16+ layer LLMs valuable at this time?

Comment by RGRGRG on How did you make your way back from meta? · 2023-09-08T16:14:01.629Z · LW · GW

Most weekdays, I set the goal of myself of doing twelve focused blocks of 24 minutes of object level work (my variant on Pomodoro).  Once I complete these blocks, I can do whatever I want - whether it be stop working for the rest of the day, more object level work, meta work, or anything else.

If you try something like this, I'd recommend setting a goal of doing 6(?) such blocks and then letting yourself do as much or as little meta as you want; and then potentially gradually working up to 10-13 blocks.

Comment by RGRGRG on Why Is No One Trying To Align Profit Incentives With Alignment Research? · 2023-08-24T22:19:35.782Z · LW · GW

Over the last 3 months, I've spent some time thinking about mech interp as a for profit service.  I've pitched to one VC firm, interviewed for a few incubators/accelerators including ycombinator, sent out some pitch documents, co-founder dated a few potential cofounders, and chatted with potential users and some AI founders).

There are a few issues:

First, as you mention, I'm not sure if mech interp is yet ready to understand models.  I recently interpreted a 1-layer model trained on a binary classification function and am currently working on understanding a 1-layer language model (TinyStories-1Layer-21M). TinyStories is (much?) harder than the binary classification network (which took 24 focused days of solo research).  This isn't to say I or someone else won't have an idea how 1 layer models work a few months from now.  Once this happens, we might want to interpret multi-layer models before being ready to interpret models that are running in production.

Second, outsiders can observe that mech interp might not be far enough along to build a product around.  The feedback I received from the VC firm and YC was that my ideas weren't far enough along.

Third, I personally have not yet been able to find someone I'm excited to be cofounders with.  Some people have different visions in terms of safety (some people just don't care at all).  Other people who I share a vision with, I don't match with for other reasons.

Fourth, I'm not certain that I've yet found that ideal first customer - some people seem to think it's nice to have, but frequently with language models, if you get a bad output, you can just run it again (keeping a human in the loop).  To be clear, I haven't given up on finding that ideal customer, and it could be something like government or that customer might not exist until AI models do something really bad.

Fifth, I'm unsure if I actually want to run a company.  I love doing interp research and think I am quite good at it (among other things, having a software background, a PhD in Robotics, and solving puzzles).  I consider myself a 10x+ engineer.  At least right now, it seems like I can add more value by doing independent research rather than running a company.

For me, the first issue is the main one.  Once interp is farther along, I'm open to put more time into thinking about the other issues.  If anyone reading this is potentially interested in chatting, feel free to DM me.

Comment by RGRGRG on The positional embedding matrix and previous-token heads: how do they actually work? · 2023-08-15T03:16:49.913Z · LW · GW

Thank you!  I'm still surprised how little most heads in L0 + L1 seem to be using the positional embeddings.  L1H4 looks reasonably uniform so I could accept that maybe that somehow feeds into L2H2.

Comment by RGRGRG on Decomposing independent generalizations in neural networks via Hessian analysis · 2023-08-15T03:04:48.425Z · LW · GW

nit: do you mean 6x6 Boolean patterns not 4x4?

Comment by RGRGRG on The positional embedding matrix and previous-token heads: how do they actually work? · 2023-08-12T19:19:28.259Z · LW · GW

This is a surprising and fascinating result.  Do you have attention plots of all 144 heads you could share?

I'm particularly interested in the patterns for all heads on layers 0 and 1 matching the following caption

(Left: a 50x50 submatrix of LXHY's attention pattern on a prompt from openwebtext-10k. Right: the same submatrix of LXHY's attention pattern, when positional embeddings are averaged as described above.)

Comment by RGRGRG on Thoughts on sharing information about language model capabilities · 2023-08-01T23:10:58.259Z · LW · GW

As one specific example - has RLHF, which the below post suggests was potentially was initially intended for safety, been a net negative for AI safety?

Comment by RGRGRG on Thoughts on sharing information about language model capabilities · 2023-08-01T17:26:14.905Z · LW · GW

My primary safety concern is what happens if one of these analyses somehow leads to a large improvement over the state of the art.  I don't know what form this would take and it might be unexpected given the Bitter Lesson you cite above, but if it happens, what do we do then?  Given this is hypothetical and the next large improvement in LMs could come elsewhere, I'm not suggesting we stop sharing now.  But I think we should be prepared that there might be a point in time where we need to acknowledge such sharing leads to significantly stronger models and thus should re-evaluate sharing such eval work.

Comment by RGRGRG on Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2) · 2023-07-31T19:24:50.038Z · LW · GW

The differences between these two projects seem like an interesting case study in MI. I'll probably refer to this a lot in the future. 

Excited to see case studies comparing and contrasting our works.  Not that you need my permission, but feel free to refer to this post (and if it's interesting, this comment) as much or as little as desired.

One thing that I don't think came out in my post is that my initial reaction to the previous solution was that it was missing some things and might even have been mostly wrong.  (I'm still not certain that it's not at least partially wrong, but this is harder to defend and I suspect might be a minority opinion).  

Contrast this to your first interp challenge - I had a hypothesis of "slightly slant-y (top left to bottom right)" images for one of the classes.  After reading the first paragraph of the tl;dr of their written solution to the first challenge -  I was extremely confident they were correct.

Comment by RGRGRG on Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2) · 2023-07-31T18:59:38.068Z · LW · GW

One thought I've had, inspired by discussion (explained more later), is whether: 

"label[ing] points by interpolating" is not the opposite of "developing an interesting, coherent internal algorithm.”   (This is based on a quote from Stephen Casper's retrospective that I also quoted in my post).

It could be the case that the network might have "develop[ed] an interesting, coherent algorithm", namely the row coloring primitives discussed in this post, but uses "interpolation/pattern matching" to approximately detect the cutoff points.

When I started this work, I hoped to find more clearly increasing or decreasing embedding circuits dictating the cutoff points, which would be interpretable without falling back to "pattern matching".  (This was the inspiration for adding X and Y embeddings in Section 5.  Resulting curves are not as smooth as I'd hoped).  I think the next step (not sure if I will do this) might be to continue training this network, either simply for longer, with smaller batches, or with the entire input set (not holding about half out for testing) to see if resulting curves become smoother.


This thought was inspired by a short email discussion I had with Marius Hobbhahn, one of the authors of the original solution.  I have his permission to share content from our email exchange here.  Marius wants me to "caveat that [he, Marius] didn’t spend a lot of time thinking about [my original post], so [any of his thoughts from our email thread] may well be wrong and not particularly helpful for people reading [this comment]".   I'm not sure this caveat just adds noise since this thought is mine (he has not commented on this thought) and I don't currently think it is worthwhile to summarize the entire thread (and the caveat was requested when I initially asked if I could summarize our entire thread), so not sharing any of his thoughts here, but I want to respect his wishes even if this caveat mostly (or solely) adds noise.

Comment by RGRGRG on Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2) · 2023-07-30T19:15:17.630Z · LW · GW

Thank you for the kind words and the offer to donate (not necessary but very much appreciated).  Please donate to which is listed on Charity Navigator's list of high impact charities ( )


I will respond to the technical parts of this comment tomorrow or Tuesday.

Comment by RGRGRG on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-20T03:40:30.501Z · LW · GW

I wonder if this is simply the result of the generally bad SWE/CS market right now. People who would otherwise be in big tech/other AI stuff, will be more inclined to do something with alignment. 

This is roughly my situation.  Waymo froze hiring and had layoffs while continuing to increase output expectations.  As a result I/we had more work.   I left in March to explore AI and landed on Mechanistic Interpretability research.

Comment by RGRGRG on ARC is hiring theoretical researchers · 2023-06-19T15:16:56.063Z · LW · GW

"We will keep applications open until at least the end of August 2023"

Is there any advantage to applying early vs in August 2023?  I ask as someone intending to do a few months of focused independent MI research.  I would prefer to have more experience and sense of my interests before applying, but on the other hand, don't want to find out mid-August that you've filled all the roles and thus it's actually too late to apply. Thanks.

Comment by RGRGRG on Why I'm Not (Yet) A Full-Time Technical Alignment Researcher · 2023-05-25T17:36:28.301Z · LW · GW

Thanks for posting this - not OP, but I will likely apply come early June.  If anyone else is associated with other grant opportunities, would love to hear about those as well.

Comment by RGRGRG on Why I'm Not (Yet) A Full-Time Technical Alignment Researcher · 2023-05-25T17:35:17.836Z · LW · GW

Just wanted to say that I have similar questions about how to best (try to) get funding for mechanistic interpretability research.  Might send a bunch of apps out come early June; but like OP, I don't have any technical results in alignment (though like OP, I like to think I have a solid (yet different) background).

Comment by RGRGRG on Best Ways to Try to Get Funding for Alignment Research? · 2023-04-05T04:11:56.897Z · LW · GW


Comment by RGRGRG on Best Ways to Try to Get Funding for Alignment Research? · 2023-04-04T17:27:45.407Z · LW · GW


Comment by RGRGRG on Best Ways to Try to Get Funding for Alignment Research? · 2023-04-04T06:46:14.140Z · LW · GW


> key problems 

Is there a blog post to key problems?

> sharing their plans

Where is the best place to share? Once I come up with a plan I'm happy with, is there value in posting it on this site?

Comment by RGRGRG on Nobody’s on the ball on AGI alignment · 2023-03-31T03:42:31.597Z · LW · GW

What is the best way (assuming one exists), as an independent researcher, with a PhD in AI but not in ML, to get funding to do alignment work?  (I recently left my big tech job).