A Rocket–Interpretability Analogy

post by plex (ete) · 2024-10-21T13:55:18.184Z · LW · GW · 20 comments

Contents

   1. 
  2.
  3.
None
20 comments

 1. 

4.4% of the US federal budget went into the space race at its peak.

This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.[1]

I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.

2.

The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])

This was surprising to me[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.[4]

I wonder how much more enthusiastic the alignment researchers working on interpretability and obedience are, with the motivating story “I’m working on pure alignment research to save the world” vs “I’m building tools and knowledge which scaling labs will repurpose to build better products, shortening timelines to existentially threatening systems”.[5]

3.

You can’t rely on the organizational systems around you to be pointed in the right direction, and there are obvious reasons for commercial incentives to want to channel your idealistic energy towards types of safety work which are dual-use or even primarily capabilities enabling. And for similar reasons, many of the training programs prepare people for the kind of jobs which come with large salaries and prestige, as a flawed proxy for people moving the needle on x-risk.

If you’re genuinely trying to avert AI doom, please take the time to form inside views away from memetic environments[6] which are likely to have been heavily influenced by commercial pressures. Then back-chain from a theory of change where the world is more often saved by your actions, rather than going with the current and picking a job with safety in its title as a way to try and do your part.

  1. ^

    Space Race - Wikipedia:

    It had its origins in the ballistic missile-based nuclear arms race between the two nations following World War II and had its peak with the more particular Moon Race to land on the Moon between the US moonshot and Soviet moonshot programs. The technological advantage demonstrated by spaceflight achievement was seen as necessary for national security and became part of the symbolism and ideology of the time.

  2. ^

    Andrew Critch:

    I hate that people think AI obedience techniques slow down the industry rather than speeding it up. ChatGPT could never have scaled to 100 million users so fast if it wasn't helpful at all.

     

    Making AI serve humans right now is highly profit-aligned and accelerant.

     

    Of course, later when robots could be deployed to sustain an entirely non-human economy of producers and consumers, there will be many ways to profit — as measured in money, materials, compute, energy, intelligence, or all of the above — without serving any humans. But today, getting AI to do what humans want is the fastest way to grow the industry.

  3. ^

    These paradigms do not seem to be addressing the most fatal filter in our future: Strongly coherent goal-directed agents forming with superhuman intelligence. These will predictably undergo a sharp left turn [LW · GW] and the soft/fuzzy alignment techniques which worked at lower power levels fail simultaneously and as the system reaches high enough competence to reflect on itself, its capabilities, and the guardrails we built.

    Interpretability work could plausibly help with weakly aligned weakly superintelligent systems that do our alignment homework for the much more capable systems to come. But the effort going into this direction seems highly disproportionate to how promising it is, is not backed by plans to pivot to using these systems to do a quite different style of alignment research that's needed, and generally lacks research closure [AF · GW] to avert capabilities externalities.

  4. ^

     From the team that broke the quadratic attention bottleneck:

    Simpler sub-quadratic designs such as Hyena, informed by a set of simple guiding principles and evaluation on mechanistic interpretability benchmarks, may form the basis for efficient large models.

  5. ^

    Ask yourself: “Who will cite my work?”, not "Can I think of a story where my work is used for good things?"

    There is work in these fields which might be good for x-risk, but you need to figure out if what you're doing is in that category to be good for the world.

  6. ^

    Humans are natural mimics, we copy the people who have visible signals of doing well, because those are the memes which are likely to be good for our genes, and genes direct where we go looking for memes.

    Wealth, high confidence that they’re doing something useful, being part of a growing coalition; great signs of good memes. All much more possessed by people in the interpretability/obedience kind of alignment than the old-school “this is hard and we don’t know what we’re doing, but it’s going to involve a lot of careful philosophy and math” crowd.

    Unfortunately, this memetic selection is not particularly adaptive for trying to solve alignment.

20 comments

Comments sorted by top scores.

comment by leogao · 2024-10-21T18:02:00.523Z · LW(p) · GW(p)

I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)

I think you should take into account the fact that before there are really good concrete capabilities results, the process that different labs use to decide what to invest in is highly contingent on a bunch of high variance things. Like, what kinds of research directions appeal to research leadership, or whether there happen to be good ICs excited to work on that direction around and not tied down to any other project. 

I don't think you should be that surprised by interpretability being more popular than other areas of alignment. Certainly I think incentives towards capabilities is a small fraction of why it's popular and funded etc (if anything, its non-usefulness for capabilities to date may count against it). Rather, I think it's popular because it's an area where you can actually get traction and do well-scoped projects and have a tight feedback loop. This is not true of the majority of alignment research directions that actually could help with aligning AGI/ASI, and correspondingly those directions are utterly soul grinding to work on. 

One could very reasonably argue that more people should be figuring out how to work on the low traction, ill-scoped, shitty feedback loop research problems, and that the field is looking under the streetlight for the keys. I make this argument a lot. But I think you shouldn't need to postulate some kind of nefarious capabilities incentive influence to explain it.

Replies from: habryka4, ete
comment by habryka (habryka4) · 2024-10-21T19:18:53.110Z · LW(p) · GW(p)

I don't think anyone has, to date, used interpretability to make any major counterfactual contribution to capabilities. I would not rely on papers introducing a new technique to be the main piece of evidence as to whether the technique is actually good at all. (I would certainly not rely on news articles about papers - they are basically noise.)

I would bet against this on the basis that Chris Olah's work was quite influential on a huge number of people, shaped their mental models of how Deep Learning works in general, and probably contributed to lots of improved capability-oriented thinking and decision-making. 

Like, as a kind of related example where I expect it's easier to find agreement, it's hard to point to something concrete that "Linear Algebra Done Right" did to improve ML research, but I am quite confident it has had a non-trivial effect. It's the favorite Linear Algebra textbook of many of the best contributors to the field, and having good models and explanations of the basics makes a big difference.

Replies from: leogao
comment by leogao · 2024-10-21T20:52:20.295Z · LW(p) · GW(p)

For the purposes of the original question of whether people are overinvesting in interp due to it being useful for capabilities and therefore being incentivized, I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.

Separately, it's also not clear to me that the diffuse intuitions from interpretability have actually helped people a lot with capabilities. Obviously this is very hard to attribute, and I can't speak too much about details, but it feels to me like the most important intuitions come from elsewhere. What's an example of an interpretability work that you feel has affected capabilities intuitions a lot?

Replies from: Benito
comment by Ben Pace (Benito) · 2024-10-21T21:54:19.070Z · LW(p) · GW(p)

I think there's a pretty important distinction between direct usefulness and this sort of diffuse public good that is very hard to attribute. Things with large but diffuse impact are much more often underincentivized and often mostly done as a labor of love. In general, the more you think an organization is shaped by incentives that are hard to fight against, the more you should expect diffusely impactful things to be relatively neglected.

I have a model whereby ~all very successful large companies require a leader with vision, who is able to understand incentives and nonetheless take long-term action that isn't locally rewarded. YC startups constantly talk about long-term investments into culture and hiring and onboarding processes that are costly in (I'd guess) the 3-12 month time-frame but extremely valuable in the 1-5 year time frame.

Saying that a system is heavily shaped by incentives doesn't seem to me to imply that the system is heavily short-sighted. Companies like Amazon and Facebook are of course heavily shaped by incentives yet have quite long-term thinking in their leaders, who often do things that look like locally wasted effort because they have a vision of how it will pay off years down the line.

Speaking about the local political situation, I think safety investment from AI capabilities companies can be thought of as investing into problems that will come up in the future. As a more cynical hypothesis, I think it can also be usefully thought of as a worthwhile political ploy to attract talent and look responsible to regulators and intelligentsia.

(Added: Bottom-line: Following incentives does not mean short-sighted.)

comment by plex (ete) · 2024-10-21T18:14:38.405Z · LW(p) · GW(p)

That you're unaware of there being any notable counterfactual capabilities boost from interpretability is some update. How sure are you that you'd know if there were training multipliers that had interpretability strongly in their causal history? Are you not counting steering vectors from Anthropic here? And I didn't find out about the Hyena from the news article, but from a friend who read the paper, the article just had a nicer quote.

I could imagine that interpretability being relatively ML flavoured makes it more appealing to scaling lab leadership, and this is the reason those projects get favoured rather than them seeing it as commercially useful, at least in many cases.

Would you expect that this continues as interpretability continues to get better? I'd be pretty surprised from general models to find that opening black boxes doesn't let you debug them better, though I could imagine we're not good enough at it yet.

Replies from: leogao
comment by leogao · 2024-10-21T20:37:27.808Z · LW(p) · GW(p)

SAE steering doesn't seem like it obviously beats other steering techniques in terms of usefulness. I haven't looked closely into Hyena but my prior is that subquadratic attention papers probably suck unless proven otherwise.

Interpretability is certainly vastly more appealing to lab leadership than weird philosophy, but it's vastly less appealing than RLHF. But there are many many ML flavored directions and only a few of them are any good, so it's not surprising that most directions don't get a lot of attention.

Probably as interp gets better it will start to be helpful for capabilities. I'm uncertain whether it will be more or less useful for capabilities than just working on capabilities directly; on the one hand, mechanistic understanding has historically underperformed as a research strategy, on the other hand it could be that this will change once we have a sufficiently good mechanistic understanding.

comment by ryan_greenblatt · 2024-10-21T16:18:05.327Z · LW(p) · GW(p)

I'm sympathetic to 'a high fraction of "alignment/safety" work done at AI companies is done due to commercial incentives and has negligible effect on AI takeover risk (or at least much smaller effects than work which isn't influenced by commercial incentives)'.

I also think a decent number of ostensibly AI x-risk focused people end up being influenced by commercial incentives sometimes knowingly and directly (my work will go into prod if it is useful and this will be good for XYZ reason; my work will get more support if it is useful; it is good if the AI company I work for is more successful/powerful, so I will do work which is commercially useful) and sometimes unknowingly or indirectly (a direction gets more organizational support because of usefulness; people are misled into thinking something is more x-risk-helpful than it actually is).

(And a bunch of originally AI x-risk focused people end up working on things which they would agree aren't directly useful for mitigating x-risk, but have some more complex story.)

I also think AI companies generally are a bad epistemic environment for x-risk safety work for various reasons.

However, I'm quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than 1/2 funded via the directly commercial case. (I won't justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)

(That said, I think putting some effort on (mech) interp (or something similar) might end up being a decent commercial bet via direct usage, though I'm skeptical.)

I think there are some adjacent reasons alignment/safety work might be funded/encouraged at AI companies beyond direct commercial usage:

  • Alignment/safety teams or work might be subsidized via being good internal/external PR for AI companies. As in good PR to all of: the employees who care about safety, the AI-safety-adjacent community who you recruit from, and the broader public. I think probably most of this effect is on the "appeasing internal/external people who might care about safety" rather than the broader public.
  • Having an internal safety team might help reduce/avoid regulatory cost for your company or the industry. Both via helping in compliance and via reducing the burden.
Replies from: habryka4, AliceZ
comment by habryka (habryka4) · 2024-10-21T17:00:57.319Z · LW(p) · GW(p)

However, I'm quite skeptical that (mechanistic) interpretability research in particularly gets much more funding due to it directly being a good commercial bet (as in, it is worth it because it ends up directly being commercially useful). And, my guess is that alignment/safety people at AI companies which are ostensibly focused on x-risk/AI takeover prevention are less than 1/2 funded via the directly commercial case. (I won't justify this here, but I think this due to a combination of personal experience and thinking about what these teams tend to work on.)

I think the primary commercial incentive on mechanistic interpretability research is that it's the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives. I am quite confident that a non-trivial fraction of young alignment researchers are going into mech-interp because it also gets them a lot of career capital as a standard ML engineer.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-10-21T17:53:42.735Z · LW(p) · GW(p)

I think the primary commercial incentive on mechanistic interpretability research is that it's the alignment research that most provides training and education to become a standard ML engineer who can then contribute to commercial objectives.

Is your claim here that a major factor in why Anthropic and GDM do mech interp is to train employees who can later be commercially useful? I'm skeptical of this.

Maybe the claim is that many people go into mech interp so they can personally skill up and later might pivot into something else (including jobs which pay well)? This seems plausible/likely to me, though it is worth noting that this is a pretty different argument with very different implications from the one in the post.

Replies from: habryka4
comment by habryka (habryka4) · 2024-10-21T19:16:58.022Z · LW(p) · GW(p)

Yep, I am saying that supply for mech-interp alignment researchers is plenty because of career capital being much more fungible with extremely well-paying ML jobs, and Anthropic and GDM seem interested in sponsoring things like mech-interp MATS streams or other internship and junior-positions, because those fit neatly into their existing talent pipeline, they know how to evaluate that kind of work, and they think that those hires are also more likely to convert into people working on capabilities work.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-10-21T19:27:28.319Z · LW(p) · GW(p)

I'm pretty skeptical that Neel's MATS stream is partially supported/subsidized by GDM's desire to generally hire for capabilities . (And I certainly don't think they directly fund this.) Same for other mech interp hiring at GDM, I doubt that anyone is thinking "these mech interp employees might convert into employees for capabilities". That said, this sort of thinking might subsidize the overall alignment/safety team at GDM to some extent, but I think this would mostly be a mistake for the company.

Seems plausible that this is an explicit motivation for junior/internship hiring on the Anthropic interp team. (I don't think the Anthropic interp team has a MATS stream.)

Replies from: habryka4
comment by habryka (habryka4) · 2024-10-21T19:29:37.623Z · LW(p) · GW(p)

I think Neel seems to have a somewhat unique amount of freedom, so I have less of a strong take there, but I am confident that GDM would be substantially less excited about its employees taking time off to mentor a bunch of people if the kind of work they were doing would produce artifacts that were substantially less well-respected by the ML crowd, or did not look like they are demonstrating the kind of skills that are indicative of good ML engineering capability.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-10-21T21:29:17.154Z · LW(p) · GW(p)

(I think random (non-leadership) GDM employees generally have a lot of freedom while employees of other companies have much less in-practice freedom (except for maybe longer time OpenAI employees who I think have a lot of freedom).)

Replies from: habryka4
comment by habryka (habryka4) · 2024-10-21T22:35:56.216Z · LW(p) · GW(p)

(My sense is this changed a lot after the Deepmind/GBrain merger and ChatGPT, and the modern GDM seems to give people a lot less slack in the same way, though you are probably still directionally correct)

comment by ZY (AliceZ) · 2024-10-21T23:07:59.998Z · LW(p) · GW(p)

Agree with this, and wanted to add that I am also not completely sure if mechanistic interpretability is a good "commercial bet" yet based on my experience and understanding, with my definition of commercial bet being materialization of revenue or simply revenue generating. 

One revenue generating path I can see for LLMs is the company uses them to identify data that are most effective for particular benchmarks, but my current understanding (correct me if I am wrong) is that it is relatively costly to first research a reliable method, and then run interpretability methods for large models for now; additionally, it would be generally very intuitive to researchers on what datasets could be useful to specific benchmarks already. On the other hand, the method would be much useful to look into nuanced and hard to tackle safety problems. In fact there are a lot of previous efforts in using interpretability generally for safety mitigations. 

comment by Jozdien · 2024-10-21T15:31:27.980Z · LW(p) · GW(p)

The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])

I agree w.r.t potential downstream externalities of interpretability. However, my view from speaking to a fair number of those who join the field to work on interpretability is that the upstream cause is slightly different. Interpretability is easier to understand and presents "cleaner" projects than most others in the field. It's also much more accessible and less weird to say, an academic, or the kind of person who hasn't spent a lot of time thinking about alignment.

All of these are likely correlated with having downstream capabilities applications, but the exact mechanism doesn't look like "people recognizing that this has huge profit potential and therefore growing much faster than other sub-fields".

Replies from: ete
comment by plex (ete) · 2024-10-21T17:56:34.890Z · LW(p) · GW(p)

I agree that the effect you're pointing to is real and a large part of what's going on here, and could easily be convinced that it's the main cause (along with the one flagged by @habryka [LW · GW]). It's definitely a more visible motivation from the perspective of an individual going through the funnel than the one this post highlights. I was focusing on making one concise point rather than covering the whole space of argument, and am glad comments have brought up other angles.

comment by Steven Byrnes (steve2152) · 2024-10-21T18:38:46.961Z · LW(p) · GW(p)

I’m not an expert and I’m not sure it matters much for your point, but: Yes there were surely important synergies between NASA activities and the military ballistic missile programs in the 1960s, but I don’t think it’s correct to suggest that most NASA activities was stuff that would have to be done for the ballistic missile program anyway. It might actually be a pretty small fraction. For example, less than half the Apollo budget was for launch vehicles; they spent a similar amount on spacecraft, which are not particularly transferable to nukes. And even for the launch vehicles, it seems that NASA tended to start with existing military rocket designs and modify them, rather than the other way around.

I would guess that the main synergy was more indirect: helping improve the consistency of work, economies of scale, defraying overhead costs, etc., for the personnel and contractors and so on.

Replies from: ete
comment by plex (ete) · 2024-10-21T20:10:09.239Z · LW(p) · GW(p)

Yeah, I don't expect the majority of work done for NASA had direct applicability for war, even though there was some crossover. However, I'd guess that NASA wouldn't have had anything like as much overall budget if not for the overlap with a critical national security concern?

comment by Noosphere89 (sharmake-farah) · 2024-10-21T21:28:28.269Z · LW(p) · GW(p)

Some thoughts on the post as I want to collect everything.

For this specifically:

The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])

Obedience is quite literally the classical version of the AI alignment problem, or at least very entangled with it, and interpretability IMO got a lot of the initial boost from Distill, combined with it being a relatively clean problem formulation that makes alignment more scalable.

On footnote 3, I have to point out that the postulated sharp left turn in humans compared to chimps doesn't nearly have as much evidence as the model in which human success is basically attributable to culture, where we are very good at imitating others without a true causal model as well as distilling cultural knowledge down, combined with us having essentially a very good bodyplan for tool use and us being able to dissipate heat better to scale up algorithms well.

I don't totally like Henrich's book The Secret of Our Success, but I do think that at least some of the results like chimps essentially equaling humans on IQ and only losing hard when social competence was required is a surprising fact under the sharp left turn view.

While the science of human and animal brains is incomplete at this juncture, we are starting to realize that animals do in fact have much more intelligence than past people realized, and in particular for mammals, their algorithmic cores are also fairly similar in genetic space, and the only real differentiator at this point for humans is just the fact that our cultural knowledge sparked by population growth and language booms every generation.