Reflections on my first year of AI safety research

post by Jay Bailey · 2024-01-08T07:49:08.147Z · LW · GW · 3 comments

3 comments

Comments sorted by top scores.

comment by mesaoptimizer · 2024-01-08T20:47:21.408Z · LW(p) · GW(p)

I appreciate you writing this! It really helped me get a more concrete sense of what it is like for new alignment researchers (like me) to be aiming to make good things happen.

Owain asked us not to publish this for fear of capabilities improvements

Note that "capabilities improvements" can mean a lot of things here. The first thing that comes to mind is that publicizing this differentially accelerates the amount of damage API users could do with access to SOTA LLMs, which makes sense to me. It also makes sense to me that Owain would consider publishing this idea not worth the downside, simply because there's not much benefit to publicizing this, for alignment researchers and capabilities researchers, off the top of my head. OpenAI capabilities people probably have already tried such experiments internally and know of this, and alignment researchers probably wouldn't be able to build on top of this finding (here I mostly have interpretability researchers in mind).

Sometimes I needed more information to work on a task, but I tended to assume this was my fault. If I were smarter, I would be able to do it, so I didn’t want to bother Joseph for more information. I now realise this is silly—whether the fault is mine or not, if I need more context to solve a problem, I need more context, and it helps nobody to delay asking about this too much.

Oh yeah I have had this issue many (but not all of the) times with mentors in the past. I suggest not simply trying to rationalize that emotion away though, and perhaps try to actually debug it. "Whether the fault is mine or not", sure but if my brain tracks whether I am an asset or a liability to the project, then my brain is giving me important information in the form of my emotions.

Anyway, I'm glad you now have a job in the alignment field!

comment by sudo · 2024-01-10T23:41:53.562Z · LW(p) · GW(p)

Thanks for the post!

The problem was that I wasn’t really suited for mechanistic interpretability research.

Sorry if I'm prodding too deep, and feel no need to respond. I always feel a bit curious about claims such as this.

I guess I have two questions (which you don't need to answer):

  1. Do you have a hypothesis about the underlying reason for you being unsuited for this type of research? E.g. do you think you might be insufficiently interested/motivated, have insufficient conscientiousness or intelligence, etc.
  2. How confident are you that you just "aren't suited" to this type of work? To operationalize, maybe given e.g. two more years of serious effort, at what odds would you bet that you still wouldn't be very competitive at mechanistic interpretability research? 
    1. What sort of external feedback are you getting vis a vis your suitability for this type of work? E.g. have you received feedback from Neel in this vein? (I understand that people are probably averse to giving this type of feedback, so there might be many false negatives).
Replies from: Jay Bailey
comment by Jay Bailey · 2024-01-11T00:10:09.578Z · LW(p) · GW(p)

Concrete feedback signals I've received:

  • I don't find myself excited about the work. I've never been properly nerd-sniped by a mechanistic interpretability problem, and I find the day-to-day work to be more drudgery than exciting, even though the overall goal of the field seems like a good one.

  • When left to do largely independent work, after doing the obvious first thing or two ("obvious" at the level of "These techniques are in Neel's demos") I find it hard to figure out what to do next, and hard to motivate myself to do more things if I do think of them because of the above drudgery.

  • I find myself having difficulty backchaining from the larger goal to the smaller one. I think this is a combination of a motivational issue and having less grasp on the concepts.

By contrast, in evaluations, none of this is true. I am able to solve problems more effectively, I find myself actively interested in problems, (the ones I'm working on and ones I'm not) and I find myself more able to solve problems and reason about how they matter for the bigger picture.

I'm not sure how much of each is a contributor, but I suspect that if I was sufficiently excited about the day-to-day work, all the other problems would be much more fixable. There's a sense of reluctance, a sense of burden, that saps a lot of energy when it comes to doing this kind of work.

As for #2, I guess I should clarify what I mean, since there's two ways you could view "not suited".

  1. I will never be able to become good enough at this for my funding to be net-positive. There are fundamental limitations to my ability to succeed in this field.

  2. I should not be in this field. The amount of resources required to make me competitive in this field is significantly larger than other people who would do equally good work, and this is not true for other subfields in alignment.

I view my use of "I'm not suited" more like 2 than 1. I think there's a reasonable chance that, given enough time with proper effort and mentorship in a proper organisational setting (being in a setting like this is important for me to reliably complete work that doesn't excite me), I could eventually do okay at this field. But I also think that there are other people who would do better, faster, and be a better use of an organisation's money than me.

This doesn't feel like the case in evals. I feel like I can meaningfully contribute immediately, and I'm sufficiently motivated and knowledgable that I can understand the difference between my job and my mission (making AI go well) and feel confident that I can take actions to succeed in both of them.

If Omega came down from the sky and said "Mechanistic interpretability is the only way you will have any impact on AI alignment - it's this or nothing" I might try anyway. But I'm not in that position, and I'm actually very glad I'm not.