The Field of AI Alignment: A Postmortem, and What To Do About It

johnswentworth

The Field of AI Alignment: A Postmortem, and What To Do About It

post by johnswentworth · 2024-12-26T18:48:07.614Z · LW · GW · 160 comments

  What This Post Is And Isn't, And An Apology
  Why The Streetlighting?
    A Selection Model
      Selection and the Labs
    A "Flinching Away" Model
  What To Do About It
    How We Got Here
    Who To Recruit Instead
    Integration vs Separation
None
160 comments

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.

At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.

This post is on my best models of how we got here, and what to do next.

What This Post Is And Isn't, And An Apology

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].

The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:

Eliezer's List O' Doom [LW · GW]
My own Why Not Just... [? · GW] sequence
Nate's How Various Plans Miss The Hard Bits Of The Alignment Challenge [LW · GW]

(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.

Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking about you and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.

But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.

Why The Streetlighting?

A Selection Model

First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things are the right things to focus on!)

What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge [LW · GW] (ELK) doc, and spends a few months working on it. Bob is excited about debate [? · GW], and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

... of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he'll probably keep going.

And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on something is easy to measure, and the selection pressure rewards that easy metric.

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

Selection and the Labs

Here's a special case of the selection model which I think is worth highlighting.

Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)

A "Flinching Away" Model

While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?

Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop [LW · GW], breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away [LW · GW] from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.

It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:

Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray's workshop on one-shotting Baba Is You levels apparently reproduced this phenomenon [LW · GW] very reliably.)
Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
(Most common) Carol just doesn't think about the fact that the easier problems don't really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that's all the encouragement she needs.

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Which brings us to the "what to do about it" part of the post.

What To Do About It

Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?

First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.

Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.

So one obvious thing to focus on is getting traction on the problems.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at the ILIAD conference [LW · GW]; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILIAD was a theory conference, I do not mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.

Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts?

How We Got Here

The main problem, according to me, is the EA recruiting pipeline.

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.

Who To Recruit Instead

We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendation to readers to roll up their own doctorate program" [LW(p) · GW(p)]. Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

As an alternative to recruiting people who have the skills already, one could instead try to train people. I've tried that [LW · GW] to some extent, and at this point I think there just isn't a substitute for years of technical study [LW · GW]. People need that background knowledge in order to see footholds on the core hard problems.

Integration vs Separation

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.

160 comments

Comments sorted by top scores.

comment by aysja · 2024-12-27T05:29:30.240Z · LW(p) · GW(p)

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details to large level structure, this is especially important for the alignment problem), not being efficient market pilled (you have to believe that more is possible in order to aim for it), noticing confusion, and probably a lot more that I’m failing to name here.

Most importantly, though, I think it requires quite a lot of willingness to remain confused. Many scientists who accomplished great things (Darwin, Einstein) didn’t have publishable results on their main inquiry for years. Einstein, for instance, talks about wandering off for weeks in a state of “psychic tension” in his youth, it took ~ten years to go from his first inkling of relativity to special relativity, and he nearly gave up at many points (including the week before he figured it out). Figuring out knowledge at the edge of human understanding can just be… really fucking brutal. I feel like this is largely forgotten, or ignored, or just not understood. Partially that's because in retrospect everything looks obvious, so it doesn’t seem like it could have been that hard, but partially it's because almost no one tries to do this sort of work, so there aren't societal structures erected around it, and hence little collective understanding of what it's like.

Anyway, I suspect there are really strong selection pressures for who ends up doing this sort of thing, since a lot needs to go right: smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on. Indeed, the last point seems important to me—many great scientists are obsessed. Spend night and day on it, it’s in their shower thoughts, can’t put it down kind of obsessed. And I suspect this sort of has to be true because something has to motivate them to go against every conceivable pressure (social, financial, psychological) and pursue the strange meaning anyway.

I don’t think the EA pipeline is much selecting for pre-paradigmatic scientists, but I don’t think lack of trying to get physicists to work on alignment is really the bottleneck either. Mostly I think selection effects are very strong, e.g., the Sequences was, imo, one of the more effective recruiting strategies for alignment. I don’t really know what to recommend here, but I think I would anti-recommend putting all the physics post-docs from good universities in a room in the hope that they make progress. Requesting that the world write another book as good as the Sequences is a... big ask, although to the extent it’s possible I expect it’ll go much further in drawing people out who will self select into this rather unusual "job."

Replies from: Raemon, johnswentworth, Seth Herd, Chris_Leong

↑ comment by Raemon · 2024-12-30T06:33:55.901Z · LW(p) · GW(p)

This is the sort of thing I find appealing to believe, but I feel at least somewhat skeptical of. I notice a strong emotional pull to want this to be true (as well as an interesting counterbalancing emotional pull for it to not be true).

I don't think I've seen output from the people aspiring in this direction without being visibly quite smart to make me think "okay yeah it seems like it's on track in some sense."

I'd be interested in hearing more explicit cruxes from you about it.

I do think it's plausible than the "smart enough, creative enough, strong epistemics, independent, willing to spend years without legible output, exceptionally driven, and so on" are sufficient (if you're at least moderately-but-not-exceptionally-smart). Those are rare enough qualities that it doesn't necessarily feel like I'm getting a free lunch, if they turn out to be sufficient for groundbreaking pre-paradigmatic research. I agree the x-risk pipeline hasn't tried very hard to filter for and/or generate people with these qualities.

(well, okay, "smart enough" is doing a lot of work there, I assume from context you mean "pretty smart but not like genius smart")

But, I've only really seen you note positive examples, and this seems like the sort of thing that'd have a lot of survivorship bias. There can be tons of people obsessed, but not necessarily on the right things, and if you're not naturally the right cluster of obsessed + smart-in-the-right-way, I don't know whether trying to cultivate the obsession on purpose will really work.

I do nonetheless overall probably prefer people who have all your listed qualities, and who also either can:

a) self-fund to pursue the research without having to make it legible to others
b) somehow figure out a way to make it legible along the way

I probably prefer those people to tackle "the hard parts of alignment" over many other things they could be doing, but not overwhelmingly obviously (and I think it should come with a background awareness that they are making a gamble, and if they aren't the sort of person who must make that gamble due to their personality makeup, they should be prepared for the (mainline) outcome that it just doesn't work out)

↑ comment by johnswentworth · 2024-12-30T22:19:15.355Z · LW(p) · GW(p)

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.

To be clear, I wasn't talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.

The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can see some analogy to something they know about, and try borrowing tools that work on that other thing. Thus the importance of a large volume of technical knowledge.

Replies from: JuliaHP

↑ comment by JuliaHP · 2024-12-31T12:43:53.936Z · LW(p) · GW(p)

Curious about what it would look like to pick up the relevant skills, especially the subtle/vague/tacit skills, in an independent-study setting rather than in academia. As well as the value of doing this, IE maybe its just a stupid idea and its better to just go do a PhD. Is the purpose of a PhD to learn the relevant skills, or to filter for them? (If you have already written stuff which suffices as a response, id be happy to be pointed to the relevant bits rather than having them restated)

"Broad technical knowledge" should be in some sense the "easiest" (not in terms of time-investment, but in terms of predictable outcomes), by reading lots of textbooks (using similar material as your study guide [LW · GW]).

Writing/communication, while more vague, should also be learnable by just writing a lot of things, publishing them on the internet for feedback, reflecting on your process etc.

Something like "solving novel problems" seems like a much "harder" one. I don't know if this is a skill with a simple "core" or a grab-bag of tactics. Textbook problems take on a "meant-to-be-solved" flavor and I find one can be very good at solving these without being good at tackling novel problems. Another thing I notice is that when some people (myself included) try solving novel problems, we can end up on a path which gets there eventually, but if given "correct" feedback integration would go OOM faster.

I'm sure there are other vague-skills which one ends up picking up from a physics PhD. Can you name others, and how one picks them up intentionally? Am I asking the wrong question?

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-31T14:39:55.205Z · LW(p) · GW(p)

I currently think broad technical knowledge is the main requisite, and I think self-study can suffice for the large majority of that in principle. The main failure mode I see would-be autodidacts run into is motivation, but if you can stay motivated then there's plenty of study materials.

For practice solving novel problems, just picking some interesting problems (preferably not AI) and working on them for a while is a fine way to practice.

Replies from: johannes-c-mayer, JuliaHP

↑ comment by Johannes C. Mayer (johannes-c-mayer) · 2024-12-31T17:09:28.196Z · LW(p) · GW(p)

Why not AI? Is it that AI alignment is too hard? Or do you think it's likely one would fall into the "try a bunch of random stuff" paradigm popular in AI, which wouldn't help much in getting better at solving hard problems?

What do you think about the strategy of instead of learning a textbook e.g. on information theory, or compilers you try to write the textbook and only look at existing material if you are really stuck. That's my primary learning strategy.

It's very slow and I probably do it too much, but it allows me to train to solve hard problems that aren't super hard. If you read all the text books all the practice problems remaining are very hard.

↑ comment by JuliaHP · 2024-12-31T16:29:46.488Z · LW(p) · GW(p)

(That broad technical knowledge is the main thing (as opposed to tacit skills) why you value a physics PhD is a really surprising response to me, and seems like an important part of the model that didn't come across from the post.)

↑ comment by Seth Herd · 2024-12-27T18:51:16.623Z · LW(p) · GW(p)

I think this is right. A couple of follow-on points:

There's a funding problem if this is an important route to progress. If good work is illegible for years, it's hard to decide who to fund, and hard to argue for people to fund it. I don't have a proposed solution, but I wanted to note this large problem.

Einstein did his pre-paradigmatic work largely alone. Better collaboration might've sped it up.

LessWrong allows people to share their thoughts prior to having publishable journal articles and get at least a few people to engage.

This makes the difficult pre-paradigmatic thinking a group effort instead of a solo effort. This could speed up progress dramatically.

This post and the resulting comments and discussions is an example of the community collectively doing much of the work you describe: traversing levels, practicing good epistemics, and remaining confused.

Having conversations with other LWers (on calls, by DM, or in extended comment threads) is tremendously useful for me. I could produce those same thoughts and critiques, but it would take me longer to arrive at all of those different viewpoints of the issue. I mention this to encourage others to do it. Communication takes time and some additional effort (asking people to talk), but it's often well worth it. Talking to people who are interested in and knowledgeable on the same topics can be an enormous speedup in doing difficult pre-paradigmatic thinking.

LessWrong isn't perfect, but it's a vast improvement on the collaboration tools and communities that have been available to scientists in other fields. We should take advantage of it.

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2024-12-29T16:22:45.806Z · LW(p) · GW(p)

Einstein did his pre-paradigmatic work largely alone. Better collaboration might've sped it up.

I think this is false. As I remember hearing the story, he where corresponding with several people via letters.

Replies from: steve2152, Seth Herd

↑ comment by Steven Byrnes (steve2152) · 2024-12-31T12:20:33.400Z · LW(p) · GW(p)

I know very little, but there’s a fun fact here: “During their lifetimes, Darwin sent at least 7,591 letters and received 6,530; Einstein sent more than 14,500 and received more than 16,200.” (Not sure what fraction was technical vs personal.)

Also, this is a brief summary of Einstein’s mathematician friend Marcel Grossmann’s role in general relativity.

Replies from: Seth Herd

↑ comment by Seth Herd · 2025-01-01T20:45:08.946Z · LW(p) · GW(p)

In the piece you linked, it sounds like Einstein had the correct geometry for general relativity one day after he asked for help finding one. Of course, that's one notable success amongst perhaps a lot of collaboration. The number of letters he sent and received implies that he actually did a lot of written collaboration.

I wonder about the value of real-time conversation vs. written exchanges. And the value of being fully engaged; truly curious about your interlocutor's ideas.

My own experience watching progress happen (and not-happen) in theoretical neuroscience is that fully engaged conversations with other true experts with different viewpoints was rare and often critical for real progress.

My perception is that those conversations are tricky to produce. Experts are often splitting their attention between impressing people and coolheaded, openminded discussion. And they weren't really seeking out these conversations, just having them when it was convenient, and being really fully engaged only when the interpersonal vibe happened to be right. Even so, the bit of real conversation I saw seemed quite important.

It would be helpful to understand collaboration on difficult theory better, but it would be a whole research topic.

↑ comment by Seth Herd · 2024-12-29T21:28:36.240Z · LW(p) · GW(p)

By largely alone I meant without the rich collaboration of having an office in the same campus or phone calls or LessWrong.

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2024-12-29T22:40:06.850Z · LW(p) · GW(p)

I think the qualitive difference is not as large as you think it is. But I also don't think this is very crux-y for anything, so I will not try to figure out how to translate my reasoning to words, sorry.

↑ comment by Chris_Leong · 2024-12-27T17:17:54.023Z · LW(p) · GW(p)

Agreed. Simply focusing on physics post-docs feels too narrow to me.

Then again, just as John has a particular idea of what good alignment research looks like, I have my own idea: I would lean towards recruiting folk with both a technical and a philosophical background. It's possible that my own idea is just as narrow.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-27T17:47:01.741Z · LW(p) · GW(p)

Simply focusing on physics post-docs feels too narrow to me.

The post did explicitly say "Obviously that doesn't mean we exclusively want physics postdocs".

Replies from: Chris_Leong

↑ comment by Chris_Leong · 2024-12-28T02:12:08.411Z · LW(p) · GW(p)

Thanks for clarifying. Still feels narrow as a primary focus.

comment by Vanessa Kosoy (vanessa-kosoy) · 2024-12-27T13:08:21.906Z · LW(p) · GW(p)

Good post, although I have some misgivings about how unpleasant it must be to read for some people.

One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations^[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI's extreme pessimism deflates motivation and naturally produces the thought "if they are right then we're doomed anyway, so might as well assume they are wrong".

Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I'm also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization^[2].

MIRI made important progress in agent foundations, but also missed an opportunity to do much more. And, while the AI game board is grim, their extreme pessimism is unwarranted overconfidence. Our understanding of AI and agency is poor: this is a strong reason to be pessimistic, but it's also a reason to maintain some uncertainty about everything (including e.g. timelines).

Now, about what to do next. I agree that we need to have our own non-streetlighting community. In my book "non-streelighting" means mathematical theory plus empirical research that is theory-oriented: designed to test hypotheses made by theoreticians and produce data that best informs theoretical research (these are ~necessary but insufficient conditions for non-streetlighting). This community can and should engage with the rest of AI safety, but has to be sufficiently undiluted to have healthy memetics and cross-fertilization.

What does a community look like? It looks like our own organizations, conferences, discussion forums, training and recruitment pipelines, academia labs, maybe journals.

From my own experience, I agree that potential contributors should mostly have skills and knowledge on the level of PhD+. Highlighting physics might be a valid point: I have a strong background in physics myself. Physics teaches you a lot about connecting math to real-world problems, and is also in itself a test-ground for formal epistemology. However, I don't think a background in physics is a necessary condition. At the very least, in my own research programme I have significant room for strong mathematicians that are good at making progress on approximately-concrete problems, even if they won't contribute much on the more conceptual/philosophic level.

^{^}
Which is, creating mathematical theory and tools for understanding agents.
^{^}
I mostly didn't feel comfortable talking about it in the past, because I was on MIRI's payroll. This is not MIRI's fault by any means: they never pressured me to avoid voicing opinions. It still feels unnerving to criticize the people who write your paycheck.

Replies from: johannes-c-mayer, joshcohen

↑ comment by Johannes C. Mayer (johannes-c-mayer) · 2024-12-28T22:25:41.131Z · LW(p) · GW(p)

What are some concrete examples of the of research that MIRI insufficiently engaged with? Are there general categories of prior research that you think are most underutilized by alignment researchers?

Replies from: vanessa-kosoy

↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2024-12-29T10:28:40.048Z · LW(p) · GW(p)

Learning theory, complexity theory and control theory. See the "AI theory" section [LW · GW] of the LTA reading list.

↑ comment by joshcohen · 2025-01-04T01:40:08.313Z · LW(p) · GW(p)

MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris)

MIRI had a lot of "not invented here" mindset. It was pointed out, e.g. [here](https://web.archive.org/web/20170918044233/http://files.openphilanthropy.org/files/Grants/MIRI/consolidated_public_reviews.pdf) but unfortunately the mindset is self-reinforcing so there wasn't much to be done.

Replies from: Satron

↑ comment by Satron · 2025-01-22T16:55:32.279Z · LW(p) · GW(p)

It was pointed out, e.g. [here](https://web.archive.org/web/20170918044233/http://files.openphilanthropy.org/files/Grants/MIRI/consolidated_public_reviews.pdf) but unfortunately the mindset is self-reinforcing so there wasn't much to be done.

Your link formatting got messed up here.

comment by leogao · 2024-12-28T05:00:44.950Z · LW(p) · GW(p)

I'm sympathetic to most prosaic alignment work being basically streetlighting. However, I think there's a nirvana fallacy going on when you claim that the entire field has gone astray. It's easiest to illustrate what I mean with an analogy to capabilities.

In capabilities land, there were a bunch of old school NLP/CV people who insisted that there's some kind of true essence of language or whatever that these newfangled neural network things weren't tackling. The neural networks are just learning syntax, but not semantics, or they're ungrounded, or they don't have a world model, or they're not representing some linguistic thing, so therefore we haven't actually made any progress on true intelligence or understanding etc etc. Clearly NNs are just progress on the surface appearance of intelligence while actually just being shallow pattern matching, so any work on scaling NNs is actually not progress on intelligence at all. I think this position has become more untenable over time. A lot of people held onto this view deep into the GPT era but now even the skeptics have to begrudgingly admit that NNs are pretty big progress even if additional Special Sauce is needed, and that the other research approaches towards general intelligence more directly haven't done better.

It's instructive to think about why this was a reasonable thing for people to have believed, and why it turned out to be wrong. It is in fact true that NNs are kind of shallow pattern matchy even today, and that literally just training bigger and bigger NNs eventually runs into problems. Early NNs - heck, even very recent NNs - often have trouble with relatively basic reasoning that humans have no problem with. But the mistake is assuming that this means no progress has been made on "real" intelligence just because no NN so far has perfectly replicated all of human intelligence. Oftentimes, progress towards the hard problem does actually not immediately look like tackling the meat of the hard problem directly.

Of course, there is also a lot of capabilities work that is actually just completely useless for AGI. Almost all of it, in fact. Walk down the aisle at neurips and a minimum of 90% of the papers will fall in this category. A lot of it is streetlighting capabilities in just the way you describe, and does in fact end up completely unimpactful. Maybe this is because all the good capabilities work happens in labs nowadays, but this is true even at earlier neuripses back when all the capabilities work got published. Clearly, a field can be simultaneously mostly garbage and also still make alarmingly fast progress.

I think this is true for basically everything - most work will be crap (often predictably so ex ante), due in part to bad incentives, and then there will be a few people who still do good work anyways. This doesn't mean that any pile of crap must have some good work in there, but it does mean that you can't rule out the existence of good work solely by pointing at the crap and the incentives for crap. I do also happen to believe that there is good work in prosaic alignment, but that goes under the object level argument umbrella, so I won't hash it out here.

Replies from: johnswentworth, eggsyntax

↑ comment by johnswentworth · 2024-12-28T09:48:17.881Z · LW(p) · GW(p)

I think you have two main points here, which require two separate responses. I'll do them opposite the order you presented them.

Your second point, paraphrased: 90% of anything is crap, that doesn't mean there's no progress. I'm totally on board with that. But in alignment today, it's not just that 90% of the work is crap, it's that the most memetically successful work is crap. It's not the raw volume of crap that's the issue so much as the memetic selection pressures.

Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment [LW(p) · GW(p)] for an object-level example of what I mean.

Replies from: leogao, william-brewer

↑ comment by leogao · 2024-12-28T10:05:34.919Z · LW(p) · GW(p)

in capabilities, the most memetically successful things were for a long time not the things that actually worked. for a long time, people would turn their noses at the idea of simply scaling up models because it wasn't novel. the papers which are in retrospect the most important did not get that much attention at the time (e.g gpt2 was very unpopular among many academics; the Kaplan scaling laws paper was almost completely unnoticed when it came out; even the gpt3 paper went under the radar when it first came out.)

one example of a thing within prosaic alignment that i feel has the possibility of generalizability is interpretability. again, if we take the generalizability criteria and map it onto the capabilities analogy, it would be something like scalability - is this a first step towards something that can actually do truly general reasoning, or is it just a hack that will no longer be relevant once we discover the truly general algorithm that subsumes the hacks? if it is on the path, can we actually shovel enough compute into it (or its successor algorithms) to get to agi in practice, or do we just need way more compute than is practical? and i think at the time of gpt2 these were completely unsettled research questions! it was actually genuinely unclear whether writing articles about ovid's unicorn is a genuine first step towards agi, or just some random amusement that will fade into irrelevancy. i think interp is in a similar position where it could work out really well and eventually become the thing that works, or it could just be a dead end.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-29T00:16:28.542Z · LW(p) · GW(p)

If you're thinking mainly about interp, then I basically agree with what you've been saying. I don't usually think of interp as part of "prosaic alignment", it's quite different in terms of culture and mindset and it's much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don't seem too bad.

If we had about 10x more time than it looks like we have, then I'd say the field of interp is plausibly on track to handle the core problems of alignment.

Replies from: leogao

↑ comment by leogao · 2024-12-29T01:23:30.401Z · LW(p) · GW(p)

ok good that we agree interp might plausibly be on track. I don't really care to argue about whether it should count as prosaic alignment or not. I'd further claim that the following (not exhaustive) are also plausibly good (I'll sketch each out for the avoidance of doubt because sometimes people use these words subtly differently):

model organisms - trying to probe the minimal sets of assumptions to get various hypothesized spicy alignment failures seems good. what is the least spoonfed demonstration of deceptive alignment we can get that is analogous mechanistically to the real deal? to what extent can we observe early signs of the prerequisites in current models? which parts of the deceptive alignment arguments are most load bearing?
science of generalization - in practice, why do NNs sometimes generalize and sometimes not? why do some models generalize better than others? In what ways are humans better or worse than NNs at generalizing? can we understand this more deeply without needing mechanistic understanding? (all closely related to ELK)
goodhart robustness - can you make reward models which are calibrated even under adversarial attack, so that when you optimize them really hard, you at least never catastrophically goodhart them?
scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it's maybe a bit cost uncompetitive)?

Replies from: johnswentworth, leon-lang

↑ comment by johnswentworth · 2024-12-29T08:10:48.037Z · LW(p) · GW(p)

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I've ever heard of anyone actually doing in any of those categories.

That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.

EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your "science of generalization" category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.

Replies from: alexander-gietelink-oldenziel, bronson-schoen

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-12-29T09:42:06.206Z · LW(p) · GW(p)

I think we would all be interested to hear you elaborate on why you think these approaches have approximately no value. Perhaps this will be in a follow-up post.

↑ comment by Bronson Schoen (bronson-schoen) · 2024-12-29T10:36:50.449Z · LW(p) · GW(p)

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI).

It’s difficult for me to understand how this could be “basically useless in practice” for:

scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it's maybe a bit cost uncompetitive)?

It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.

↑ comment by Leon Lang (leon-lang) · 2024-12-29T09:31:27.924Z · LW(p) · GW(p)

Thanks for the list! I have two questions:

1: Can you explain how generalization of NNs relates to ELK? I can see that it can help with ELK (if you know a reporter generalizes, you can train it on labeled situations and apply it more broadly) or make ELK unnecessary (if weak to strong generalization perfectly works and we never need to understand complex scenarios). But I’m not sure if that’s what you mean.

2: How is goodhart robustness relevant? Most models today don’t seem to use reward functions in deployment, and in training the researchers can control how hard they optimize these functions, so I don’t understand why they necessarily need to be robust under strong optimization.

↑ comment by yams (william-brewer) · 2025-01-31T17:25:15.344Z · LW(p) · GW(p)

there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem

There are plenty of cases where John can glance at what people are doing and see pretty clearly that it is not progress toward the hard problem.

Importantly, people with the agent foundations class of anxieties (which I embrace; I think John is worried about the right things!) do not spend time engaging on a gears level with prominent prosaic paradigms and connecting the high level objection ("it ignores the hard part of the problem") with the details of the research.

"But Tsvi and John actually spend a lot of time doing this."

No, they don't! They paraphrase the core concern over and over again, often seemingly without reading the paper. I don't think reading the paper would change your minds (nor should it!), but I think that there's a culture problem tied to this off-hand dismissal of prosaic work that disincentivizes potential agent foundations (or similar new thing that shares the core concerns of agent foundations) researchers from engaging with, i.e., John.

Prosaic work is fraught and, much of it, doomed. New researchers over-index on tractability because short feedback loops are comforting ('street-lighting'). Why aren't we explaining why that is, on the terms of the research itself, rather than expecting people to be persuaded by the same high level point getting hammered into them again and again?

I've watched this work in real-time. If you listen to someone talk about their work, or read their paper and follow up in person, they are often receptive to a conversation about worlds in which their work is ineffective, evidence that we're likely to be in such a world, and even to shifting the direction of their work in recognition of that evidence.

Instead, people with their eye on the ball are doing this tribalistic(-seeming) thing.

Yup, the deck is stacked against humanity solving the hard problems; for some reason, folks who know that are also committed to playing their hands poorly, and then blaming (only) the stacked deck!

John's recent post on control is a counter-example to the above claims and was, broadly, a big step in the right direction, but had some issues with it, as raised by Redwood in the comments, which are a natural consequence of it being ~a new thing John was doing. I look forward to more posts like that in the future, from John and others, that help new entrants to empirical work (which has a robust talent pipeline!) understand, integrate, and even pivot in response to, the hard parts of the problem.

[edit: I say 'gears level' a couple times, but mean 'more in the direction of gears-level than the critiques that have existed so far']

Replies from: johnswentworth

↑ comment by johnswentworth · 2025-01-31T18:23:51.066Z · LW(p) · GW(p)

Big crux here: I don't actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result.

As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky [LW · GW] and with Christiano [LW · GW]. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs.

By contrast, when I read the stuff written on the control agenda... it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever's writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe's recent post [LW · GW] on "fake vs real thinking" feels like it's pointing at the right thing here; the posts on control feel strongly like "fake" thinking.) And that's not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways.

... so mostly I've tried to argue at a different level, like e.g. in the Why Not Just... [? · GW] posts. The goal there isn't really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.

Replies from: william-brewer

↑ comment by yams (william-brewer) · 2025-01-31T18:58:34.988Z · LW(p) · GW(p)

I think the reason nobody will do anything useful-to-John as a result of the control critique post is that control is explicitly not aiming at the hard parts of the problem, and knows this about itself. In that way, control is an especially poorly selected target if the goal is getting people to do anything useful-to-John. I'd be interested in a similar post on the Alignment Faking paper (or model organisms more broadly), on RAT, on debate, on faithful CoT, on specific interpretability paradigms (circuits v SAEs, vs some coherentist approach vs shards vs....), and would expect those to have higher odds of someone doing something useful-to-John. But useful-to-John isn't really the metric I think the field should be using, either....

I'm kind of picking on you here because you are least guilty of this failing relative to researchers in your reference class. You are actually saying anything at all, sometimes with detail, about how you feel about particular things. However, you wouldn't be my first-pick judge for what's useful; I'd rather live in a world where like half a dozen people in your reference class are spending non-zero time arguing about the details of the above agendas and how they interface with your broader models, so that the researchers working on those things can update based on those critiques (there may even be ways for people to apply the vector implied by y'all's collective input, and generate something new / abandon their doomed plans).

↑ comment by eggsyntax · 2025-01-01T00:08:22.353Z · LW(p) · GW(p)

A lot of people held onto this view deep into the GPT era but now even the skeptics have to begrudgingly admit that NNs are pretty big progress even if additional Special Sauce is needed

It's a bit tangential to the context, but this is a topic I have an ongoing interest in: what leads you to believe that the skeptics (in particular NLP people in the linguistics community) have shifted away from their previous positions? My impression has been that many of them (though not all) have failed to really update to any significant degree. Eg here's a paper from just last month which argues that we must not mistake the mere engineering that is LLM behavior for language understanding or production.

comment by Jan_Kulveit · 2024-12-26T19:28:41.123Z · LW(p) · GW(p)

My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought

where you see Sam, I suspect something like "the lab memeplexes". Lab superagents have instrumental convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
there are also incentives for people setting some of the incentives: funding something making progress on something seems more successful and easier than funding the dreaded theory

comment by [deleted] · 2024-12-26T20:30:51.622Z · LW(p) · GW(p)

(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic [LW · GW] AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)

The "What to Do About It" [LW · GW] section dances around but doesn't explicitly name one of the core challenges of theoretical agent-foundations [LW · GW] work that aims to solve the "hard bits" [LW · GW] of the alignment challenge, namely the seeming lack of reliable feedback loops [LW · GW] that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it [LW(p) · GW(p)]:

Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.

He was talking about philosophy in particular at that juncture, in response to Wei Dai's concerns [LW · GW] over metaphilosophical competence, but this point seems to me to generalize to a whole bunch of other areas as well. Indeed, I have talked about this before [LW(p) · GW(p)].

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall.

Do they get traction on "core hard problems" because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective [LW · GW] on this matter, but I just... still completely disagree, for reasons other people have pointed out [LW(p) · GW(p)] (see also Vanessa Kosoy's comment here [LW · GW]). Is this also supposed to be an implicitly assumed bit of background material?

And when we don't have those verifying experiments at hand, do we not get stuff like string theory, where the math is beautiful and exquisite (in the domains it has been extended do) but debate by "physics postdocs" over whether it's worthwhile to keep funding and pursuing it keeps raging on as a Theory of Everything keeps eliding our grasp? I'm sure people with more object-level expertise on this can correct my potential misconceptions if need be.

Idk man, some days I'm half-tempted to believe that all non-prosaic alignment work is a bunch of "streetlighting." Yeah, it doesn't result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness [LW(p) · GW(p)] to claims about what will happen [LW(p) · GW(p)] in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details [LW(p) · GW(p)] serving as evidence whether the approaches are even on the right track [LW(p) · GW(p)]. A bunch of models that seem both woefully underpowered for the Wicked Problems [LW · GW] they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs. The contents and success stories of Vanessa Kosoy's desiderata [LW · GW], or of your own search for natural abstractions [LW · GW], or of Alex Altair's essence of agent foundations [LW · GW], or of Orthogonal's QACI [LW · GW], etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind [LW(p) · GW(p)] almost surely disqualifies them as useful-in-the-real-world when the endgame [LW · GW] hits.

There's a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive [LW(p) · GW(p)] mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don't generalize well at all. And it's only by grounding yourself to reality and hugging the query tight [LW · GW] by engaging with real-world empirics that you can figure out if the approach you've chosen is in the former category as opposed to the latter.

(I'm briefly noting that I don't fully endorse everything I said in the previous 2 paragraphs, and I realize that my framing is at least a bit confrontational and unfair. Separately, I acknowledge the existence of arguably-non-prosaic and mostly theoretical alignment approaches like davidad's Open Agency Architecture [LW · GW], CHAI's CIRL [LW · GW] and utility uncertainty [LW · GW], Steve Byrnes's work on brain-like AGI safety [? · GW], etc., that don't necessarily appear to fit this mold. I have varying opinions on the usefulness and viability of such approaches.)

Replies from: sharmake-farah, Jozdien, mesaoptimizer

↑ comment by Noosphere89 (sharmake-farah) · 2024-12-27T00:38:59.094Z · LW(p) · GW(p)

I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today's AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I'm usually critical to agent foundations, but John Wentworth's work is an exception that I'd like funding for).

For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity:

https://www.lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-platonic-representation-hypothesis-beyond-in [LW · GW]

Replies from: None

↑ comment by [deleted] · 2024-12-27T09:21:41.661Z · LW(p) · GW(p)

I'm wary of a possible equivocation about what the "natural abstraction hypothesis" means here.

If we are referring [LW · GW] to the redundant information hypothesis and various kinds [LW · GW] of selection theorems, this is a mathematical framework that could end up being correct, is not at all ungrounded, and Wentworth sure seems like the man for the job.

But then you are still left with the task of grounding this framework in physical reality to allow you to make correct empirical predictions about and real-world interventions on what you will see from more advanced models. Our physical world abstracting well [LW · GW] seems plausible (not necessarily >50% likely), and these abstractions being "natural" (e.g., in a category-theoretic sense) seems likely conditional on the first clause of this sentence being true, but I give an extremely low probability to the idea that these abstractions will be used by any given general intelligence or (more to the point) advanced AI model to a large and wide enough extent that retargeting the search [LW · GW] is even close to possible.

And indeed, it is the latter question that represents the make-or-break moment for natural abstractions' theory of change [LW · GW], for it is only when the model in front of you (as opposed to some other idealized model) uses these specific abstractions that you can look through the AI's internal concepts and find your desired alignment target [LW · GW].

Rohin Shah has already explained [LW(p) · GW(p)] the basic reasons why I believe the mesa-optimizer-type search probably won't exist/be findable in the inner workings of the models we encounter: "Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency." And indeed, when I look at the only general intelligences I have ever encountered in my entire existence thus far, namely humans, I see mostly just a kludge of impulses [LW(p) · GW(p)] and heuristics [LW · GW] that depend very strongly (almost entirely) on our specific architectural make-up [LW · GW] and the contextual feedback we encounter [? · GW] in our path through life. Change either of those and the end result shifts massively.

And even moving beyond that, is the concept of the number "three" a natural abstraction? Then I see entire collections and societies of (generally intelligent) human beings today who don't adopt it. Are the notions of "pressure" and "temperature" and "entropy" natural abstractions? I look at all human beings in 1600 and note that not a single one of them had ever correctly conceptualized a formal version of any of those; and indeed, even making a conservative estimate of the human species (with an essentially unchanged modern cognitive architecture) having existed for 200k years, this means that for 99.8% of our species' history, we had no understanding whatsoever of concepts as "universal" and "natural" as that. If you look at subatomic particles like electrons [LW(p) · GW(p)] or stuff in quantum mechanics, the percentage manages to get even higher. And that's only conditioning on abstractions about the outside world that we have eventually managed to figure out; what about the other unknown unknowns?

For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity

I don't think it shows that at all, since I have not been able to find any analysis of the methodology, data generation, discussion of results, etc. With no disrespect to the author (who surely wasn't intending for his post to be taken as authoritative as a full paper in terms of updating towards his claim), this is shoddy [? · GW] science, or rather not science at all, just a context-free correlation matrix.

Anyway, all this is probably more fit for a longer discussion [LW(p) · GW(p)] at some point.

Replies from: Thane Ruthenis, sharmake-farah

↑ comment by Thane Ruthenis · 2024-12-27T16:54:35.458Z · LW(p) · GW(p)

Rohin Shah has already explained [LW(p) · GW(p)] the basic reasons why I believe the mesa-optimizer-type search probably won't exist/be findable in the inner workings of the models we encounter: "Search is computationally inefficient relative to heuristics, and we'll be selecting really hard on computational efficiency."

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work (at train-time and at inference-time both), and how much AI researchers hype it up.

By contrast, my understanding is that the sort of search John is talking about retargeting isn't the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search [LW · GW].

So it is in fact the ML researchers now who believe in the superiority of the computationally inefficient search; not the agency theorists.

Replies from: sharmake-farah, rohinmshah

↑ comment by Noosphere89 (sharmake-farah) · 2024-12-27T17:27:24.131Z · LW(p) · GW(p)

Re the OpenAI o-series and search, my initial prediction is that Q*/MCTS search will work well on problems that are easy to verify and and easy to get training data for, and not work if either of these 2 conditions are violated, and secondarily will be reliant on the model having good error correction capabilities to use the search effectively, which is why I expect we can make RL capable of superhuman performance on mathematics/programming with some rather moderate schlep/drudge work, and I also expect cost reductions such that it can actually be practical, but I'm only giving a 50/50 chance by 2028 for superhuman performance as measured by benchmarks in these domains.

I think my main difference from you, Thane Ruthenis is I expect costs to reduce surprisingly rapidly, though this is admittedly untested.

This will accelerate AI progress, but not immediately cause an AI explosion, though in the more extreme paces this could create something like a scenario where programming companies are founded by a few people smartly managing a lot of programming AIs, and programming/mathematics experiencing something like what happened to the news industry from the rise of the internet, where there was a lot of bankruptcy of the middle end, the top end won big, and most people are in the bottom end.

Also, correct point on how a lot of people's conceptions of search are babble-and-prune, not top down search like MCTS/Q*/BFS/DFS/A* (not specifically targeted at sunwillrise here).

By contrast, my understanding is that the sort of search John is talking about retargeting isn't the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search [LW · GW].

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2024-12-27T17:53:39.394Z · LW(p) · GW(p)

I'm not strongly committed to the view that the costs won't rapidly reduce: I can certainly see the worlds in which it's possible to efficiently distill trees-of-thought unrolls into single chains of thoughts. Perhaps it scales iteratively, where we train a ML model to handle the next layer of complexity by generating big ToTs, distilling them into CoTs, then generating the next layer of ToTs using these more-competent CoTs, etc.

Or perhaps distillation doesn't work that well, and the training/inference costs grow exponentially (combinatorially?).

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-12-27T18:07:02.447Z · LW(p) · GW(p)

Yeah, we will have to wait at least several years.

One confound in all of this is that big talent is moving out of OpenAI, which means I'm more bearish on the company's future prospects specifically without it being that much of a detriment towards progress towards AGI.

↑ comment by Rohin Shah (rohinmshah) · 2024-12-29T16:17:39.024Z · LW(p) · GW(p)

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work

I stand by my statement and don't think anything about the o-series model invalidates it.

And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)

There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2024-12-29T16:42:44.489Z · LW(p) · GW(p)

To lay out my arguments properly:

"Search is ruinously computationally inefficient" does not work as a counter-argument against the retargetability of search, because the inefficiency argument applies to babble-and-prune search, not to the top-down heuristical-constraint-based search that was/is being discussed.
There are valid arguments against easily-retargetable heuristics-based search as well (I do expect many learned ML algorithms to be much messier than that). But this isn't one of them.
ML researchers are currently incredibly excited about the inference-time scaling laws, talking about inference runs costing millions/billions of dollars, and how much capability will be unlocked this way.
The o-series paradigm would use this compute to, essentially, perform babble-and-prune search. The pruning would seem to be done by some easily-swappable evaluator (either the system's own judgement based on the target specified in a prompt, or an external theorem-prover, etc.).
If things will indeed go this way, then it would seem that a massive amount of capabilities will be based on highly inefficient babble-and-prune search, and that this search would be easily retargetable by intervening on one compact element of the system (the prompt, or the evaluator function).

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-12-29T17:05:31.844Z · LW(p) · GW(p)

Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".

Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a problem" (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I'd summarize this as "x-risky misalignment just doesn't happen by default", which I agree is plausible (see e.g. here [LW(p) · GW(p)]), but when I'm talking about the viability of alignment plans like "retarget the search" I generally am assuming that there is some problem to solve.

(Also, random nitpick, who is talking about inference runs of billions of dollars???)

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2024-12-29T17:39:54.402Z · LW(p) · GW(p)

Yup, I read through it after writing the previous response and now see that you don't need to be convinced of that point. Sorry about dragging you into this.
I could nitpick the details here, but I think the discussion has kind of wandered away from any pivotal points of disagreement, plus John didn't want object-level arguments under this post. So I petition to leave it at that.

Also, random nitpick, who is talking about inference runs of billions of dollars???

There's a log-scaling curve, OpenAI have already spent on the order of a million dollars [LW(p) · GW(p)] just to score well on some benchmarks, and people are talking about "how much would you be willing to pay for the proof of the Riemann Hypothesis?". It seems like a straightforward conclusion that if o-series/inference-time scaling works as well as ML researchers seem to hope, there'd be billion-dollar inference runs funded by some major institutions.

Replies from: rohinmshah

↑ comment by Rohin Shah (rohinmshah) · 2024-12-29T19:07:28.045Z · LW(p) · GW(p)

OpenAI have already spent on the order of a million dollars [LW(p) · GW(p)] just to score well on some benchmarks

Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.

↑ comment by Noosphere89 (sharmake-farah) · 2024-12-27T15:15:20.577Z · LW(p) · GW(p)

Yeah, it hasn't been shown that these abstractions can ultimately be retargeted by default for today's AI.

↑ comment by Jozdien · 2024-12-27T13:05:31.891Z · LW(p) · GW(p)

The contents and success stories of Vanessa Kosoy's desiderata [LW · GW], or of your own search for natural abstractions [LW · GW], or of Alex Altair's essence of agent foundations [LW · GW], or of Orthogonal's QACI [LW · GW], etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind [LW(p) · GW(p)] almost surely disqualifies them as useful-in-the-real-world when the endgame [LW · GW] hits.

(I haven't read your comments you link, so apologies if you've already responded to this point before).

I can't speak to most of these simply out of lack of deep familiarity, but I don't think natural abstractions is disqualified at all by this.

What do we actually want out of interpretability? I don't think mechanistic interpretability, as it stands currently, gives us explanations of the form we actually want. For example, what are a model's goals? Is it being deceptive? To get answers to those questions, you want to first know what those properties actually look like - you can't get away with identifying activations corresponding to how to deceive humans, because those could relate to a great number of things (e.g. modelling other deceptive agents). Composability [? · GW] is a very non-trivial problem.

If you want to answer those questions, you need to find a way to get better measures of whatever property you want to understand. This is the central idea behind Eliciting Latent Knowledge and other work that aims for unsupervised honesty (where the property is honesty), what I call high-level interpretability of inner [LW · GW] search [LW · GW] / objectives [LW · GW], etc.

Natural abstractions is more agnostic about what kinds of properties we would care about, and tries to identify universal building blocks for any high-level property like this. I am much more optimistic about picking a property and going with it, and I think this makes the problem easier, but that seems like a different disagreement than yours considering both are inevitably somewhat conceptual and require more prescriptive work than work focusing solely on frontier models.

If you wanted to get good handles to steer your model at all, you're going to have to do something like figuring out the nature of the properties you care about. You can definitely make that probem easier by focusing on how those properties instantiate in specific classes of systems like LLMs or neural nets (and I do in my work), but you still have to deal with a similar version of the problem in the end. John is sceptical enough of this paradigm being the one that leads us to AGI that he doesn't want to bet heavily on his work only being relevant if that turns out to be true, which I think is pretty reasonable.

(These next few sentences aren't targeted at you in particular). I often see claims made of the form: "[any work that doesn't look like working directly with LLMs] hasn't updated on the fact that LLMs happened". Sometimes that's true! But very commonly I also see the claim made without understanding what that work is actually trying to do, or what kind of work we would need to reliably align / interpret super-intelligent LLMs-with-RL. I don't know whether it's true of the other agent foundations work you link to, but I definitely don't think natural abstractions hasn't updated on LLMs being the current paradigm.

Do they get traction on "core hard problems" because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective [LW · GW] on this matter, but I just... still completely disagree, for reasons other people have pointed out [LW(p) · GW(p)] (see also Vanessa Kosoy's comment here [LW · GW]). Is this also supposed to be an implicitly assumed bit of background material?

Agreed that this is a plausible explanation of what's going on. I think that the bottleneck on working on good directions in alignment is different though, so I don't think the analogy carries over very well. I think that reliable feedback loops are very important in alignment research as well to be clear, I just don't think the connection to physicists routes through that.

↑ comment by mesaoptimizer · 2024-12-27T13:45:22.163Z · LW(p) · GW(p)

Even if I'd agree with your conclusion, your argument seems quite incorrect to me.

the seeming lack of reliable feedback loops [LW · GW] that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium

That's what math always is. The applicability of any math depends on how well the mathematical models reflect the situation involved.

would build on that to say that for every powerfully predictive, but lossy and reductive [LW(p) · GW(p)] mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don’t generalize well at all. And it’s only by grounding yourself to reality and hugging the query tight [LW · GW] by engaging with real-world empirics that you can figure out if the approach you’ve chosen is in the former category as opposed to the latter.

It seems very unlikely to me that you'd have many 'similar-looking mathematical models'. If a class of real-world situations seems to be abstracted in multiple ways such that you have hundreds (not even millions) of mathematical models that supposedly could capture its essence, maybe you are making a mistake somewhere in your modelling. Abstract away the variations. From my experience, you may have a small bunch of mathematical models that could likely capture the essence of the class of real-world situations, and you may debate with your friends about which one is more appropriate, but you will not have 'multiple similar-looking models'.

Nevertheless, I agree with your general sentiment. I feel like humans will find it quite difficult make research progress without concrete feedback loops, and that actually trying stuff with existing examples of models (that is, the stuff that Anthropic and Apollo are doing, for example) provide valuable data points.

I also recommend maybe not spending so much time reading LessWrong and instead reading STEM textbooks.

comment by testingthewaters · 2024-12-26T21:24:24.025Z · LW(p) · GW(p)

Epistemic status: This is a work of satire. I mean it---it is a mean-spirited and unfair assessment of the situation. It is also how, some days, I sincerely feel.

A minivan is driving down a mountain road, headed towards a cliff's edge with no guardrails. The driver floors the accelerator.

Passenger 1: "Perhaps we should slow down somewhat."

Passengers 2, 3, 4: "Yeah, that seems sensible."

Driver: "No can do. We're about to be late to the wedding."

Passenger 2: "Since the driver won't slow down, I should work on building rocket boosters so that (when we inevitably go flying off the cliff edge) the van can fly us to the wedding instead."

Passenger 3: "That seems expensive."

Passenger 2: "No worries, I've hooked up some funding from Acceleration Capital. With a few hours of tinkering we should get it done."

Passenger 1: "Hey, doesn't Acceleration Capital just want vehicles to accelerate, without regard to safety?"

Passenger 2: "Sure, but we'll steer the funding such that the money goes to building safe and controllable rocket boosters."

The van doesn't slow down. The cliff looks closer now.

Passenger 3: [looking at what Passenger 2 is building] "Uh, haven't you just made a faster engine?"

Passenger 2: "Don't worry, the engine is part of the fundamental technical knowledge we'll need to build the rockets. Also, the grant I got was for building motors, so we kinda have to build one."

Driver: "Awesome, we're gonna get to the wedding even sooner!" [Grabs the engine and installs it. The van speeds up.]

Passenger 1: "We're even less safe now!"

Passenger 3: "I'm going to start thinking about ways to manipulate the laws of physics such that (when we inevitably go flying off the cliff edge) I can manage to land us safely in the ocean."

Passenger 4: "That seems theoretical and intractable. I'm going to study the engine to figure out just how it's accelerating at such a frightening rate. If we understand the inner workings of the engine, we should be able to build a better engine that is more responsive to steering, therefore saving us from the cliff."

Passenger 1: "Uh, good luck with that, I guess?"

Nothing changes. The cliff is looming.

Passenger 1: "We're gonna die if we don't stop accelerating!"

Passenger 2: "I'm gonna finish the rockets after a few more iterations of making engines. Promise."

Passenger 3: "I think I have a general theory of relativity as it relates to the van worked out..."

Passenger 4: "If we adjust the gear ratio... Maybe add a smart accelerometer?"

Driver: "Look, we can discuss the benefits and detriments of acceleration over hors d'oeuvres at the wedding, okay?"

Replies from: romeostevensit, faul_sname

↑ comment by romeostevensit · 2024-12-27T07:21:19.329Z · LW(p) · GW(p)

unfortunately, the disanalogy is that any driver who moves their foot towards the brakes is almost instantly replaced with one who won't.

Replies from: testingthewaters

↑ comment by testingthewaters · 2024-12-29T08:10:14.015Z · LW(p) · GW(p)

Even so, it seems obvious to me that addressing the mysterious issue of the accelerating drivers is the primary crux in this scenario.

Replies from: romeostevensit

↑ comment by romeostevensit · 2024-12-29T09:42:15.319Z · LW(p) · GW(p)

Yes, and I don't mean to overstate a case for helplessness. Demons love convincing people that the anti demon button doesn't work so that they never press it even though it is sitting right out in the open.

↑ comment by faul_sname · 2024-12-27T00:26:50.867Z · LW(p) · GW(p)

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't you live in a different state?

Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.

Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.

Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.

Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-27T14:04:42.784Z · LW(p) · GW(p)

I think I do agree with some points in this post. This failure mode is the same as the one I mentioned about why people are doing interpretability for instance [LW · GW] (section Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high [LW · GW]), and I do think that this generalizes somewhat to whole field of alignment. But I'm highly skeptical that recruiting a bunch of physicists to work on alignment would be that productive:

Empirically, we've already kind of tested this, and it doesn't work.
- I don't think that what Scott Aaronson produced while at OpenAI [LW · GW] had really helped AI Safety: He is exactly doing what is criticized in the post: Streetlight research and using techniques that he was already familiar with from his previous field of research, I don't think the author of the OP would disagree with me. Maybe n=1, but it was one of the most promising shots.
- Two years ago, I was doing field-building and trying to source talent, primarily selecting based on pure intellect and raw IQ. I've organized the Von Neumann Symposium around the problem of corrigibility, I targeted IMO laureates, and individuals from the best school in France, ENS Ulm, which arguably has the highest concentration of future Nobel laureates in the world. However, pure intelligence doesn't work. In the long term, the individuals who succeeded in the field weren't the valedictorians from France's top school, but rather those who were motivated, had read The Sequences, were EA people, possessed good epistemology, and had a willingness to share their work online (maybe you are going to say that the people I was targeting were too young, but I think my little empirical experience is already much better than the speculation in the OP).
- My prediction is that if you put a group of skilled physicists in a room, first, it's not even sure they would find that many people motivated in this reference class, and I don't think the few who would be motivated would produce good-quality work.
- For the ML4Good bootcamps, the scoring system reflects this insight. We use multiple indicators and don't rely solely on pure IQ to select participants, because there is little correlation between pure high IQ and long term quality production.
I believe the biggest mistake in the field is trying to solve "Alignment" rather than focusing on reducing catastrophic AI risks. Alignment is a confused paradigm; it's a conflationary alliance term [? · GW] that has sedimented over the years. It's often unclear what people mean when they talk about it: Safety isn't safety without a social model [LW · GW].
- Think about what has been most productive in reducing AI risks so far? My short list would be:
  - The proposed SB 1047 legislation.
  - The short statement on AI risks
  - Frontier AI Safety Commitments, AI Seoul Summit 2024, to encourage labs to publish their responsible scaling policies.
  - Scary demonstrations to showcase toy models of deception, fake alignment, etc, and to create more scientific consensus, which is very very needed
- As a result, the field of "Risk Management" is more fundamental for reducing AI risks than "AI Alignment." In my view, the theoretical parts of the alignment field have contributed far less to reducing existential risks than the responsible scaling policies or the draft of the EU AI Act's Code of Practice for General Purpose AI Systems, which is currently not too far from being the state-of-the-art for AI risk management. Obviously, it's still incomplete, but that's the direction that is I think most productive today.
Related, The Swiss cheese model of safety is underappreciated in the field. This model has worked across other industries and seems to be what works for the only general intelligence we know: humans. Humans use a mixture of strategies for safety we could imitate for AI safety (see this draft). However, the agent foundations community seems to be completely neglecting this.

Replies from: sharmake-farah, Chris_Leong, sharmake-farah, Wcargo

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-01T17:17:16.484Z · LW(p) · GW(p)

Related, The Swiss cheese model of safety is underappreciated in the field. This model has worked across other industries and seems to be what works for the only general intelligence we know: humans. Humans use a mixture of strategies for safety we could imitate for AI safety (see this draft). However, the agent foundations community seems to be completely neglecting this.

The big issue is that a lot of the swiss cheese strategy assumes failures are sufficiently uncorrelated that multiple defense layers stack, but AI can coordinate failures such that unlikely events become probable ones even through layers of defenses.

However, I think AI progress is slow and continuous enough that I do think that swiss cheese models are reasonably useful, and I do think there's a band of performance where optimization doesn't totally obsolete the strategy.

↑ comment by Chris_Leong · 2025-03-25T07:02:08.374Z · LW(p) · GW(p)

The problem is that the Swiss cheese model and legislative efforts primarily just buy us time. We still need to be making progress towards a solution and whilst it's good for some folk to bet on us duct-taping our way through, I think we also want some folk attempting to work on things that are more principled.

↑ comment by Noosphere89 (sharmake-farah) · 2025-01-01T17:21:47.308Z · LW(p) · GW(p)

Think about what has been most productive in reducing AI risks so far? My short list would be:
The proposed SB 1047 legislation.
The short statement on AI risks
Frontier AI Safety Commitments, AI Seoul Summit 2024, to encourage labs to publish their responsible scaling policies.
Scary demonstrations to showcase toy models of deception, fake alignment, etc, and to create more scientific consensus, which is very very needed

I'd probably add AI control to this list, as it's a method to use AIs of a specific capability range without AIs escaping even assuming misalignment of AIs.

Unusually relative to most AI governance people, I think regulation is most helpful in cases where AI alignment succeeds by a combination of instruction following/personal intent alignment, but no CEV of humanity occurs, and CEV alignment only occurs some of the time (and even then, it's personal CEV alignment), which I think is the most plausible world right now.

Replies from: charbel-raphael-segerie

↑ comment by Charbel-Raphaël (charbel-raphael-segerie) · 2025-01-01T17:32:07.051Z · LW(p) · GW(p)

No, AI control doesn't pass the bar, because we're still probably fucked until we have a solution for open source AI or race for superintelligence, for example, and OpenAI doesn't seem to be planning to use control, so this looks to me like the research that's sort of being done in your garage but ignored by the labs (and that's sad, control is great I agree).

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-01T20:23:41.956Z · LW(p) · GW(p)

I think this somewhat understates the level of buy in from labs.

I agree that "quickly building superintelligence" makes control look notably less appealing. (Though note that this also applies for any prosaic method that is unlikely to directly productively scale to superintelligence.)

I'm not very worried about open source AI at the moment, but I am quite worried about inadequate security undermining control and other hopes.

Replies from: charbel-raphael-segerie

↑ comment by Charbel-Raphaël (charbel-raphael-segerie) · 2025-01-01T21:17:51.940Z · LW(p) · GW(p)

Maybe you have some information that I don't have about the labs and the buy-in? You think this applies to OpenAI and not just Anthropic?

But as far as open source goes, I'm not sure. Deepseek? Meta? Mistral? xAI? Some big labs are just producing open source stuff. DeepSeek is maybe only 6 months behind. Is that enough headway?

It seems to me that the tipping point for many people (I don't know for you) about open source is whether or not open source is better than close source, so this is a relative tipping point in terms of capabilities. But I think we should be thinking about absolute capabilities. For example, what about bioterrorism? At some point, it's going to be widely accessible. Maybe the community only cares about X-risks, but personally I don't want to die either.

Is there a good explanation online of why I shouldn't be afraid of open-source?

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2025-01-01T22:21:00.736Z · LW(p) · GW(p)

As far as open source, the quick argument is that once AI becomes sufficiently powerful, it's unlikely that the incentives are toward open sourcing it (including goverment incentives). This isn't totally obvious though, and this doesn't rule out catastrophic bioterrorism (more like COVID scale than extinction scale) prior to AI powerful enough to substantially accelerate R&D across many sectors (including bio). It also doesn't rule out powerful AI being open sourced years after it is first created (though the world might be radically transformed by this point anyway). I don't have that much of an inside view on this, but reasonable people I talk to are skeptical that open source is a very big deal (in >20% of worlds) from at least an x-risk perspective. (Seems very sensitive to questions about government response, how much stuff is driven by ideology, and how much people end up being compelled (rightly or not) by "commoditize your complement" (and ecosystem) economic arguments.)

Open source seems good on current margins, at least to the extent it doesn't leak algorithmic advances / similar.

Replies from: charbel-raphael-segerie

↑ comment by Charbel-Raphaël (charbel-raphael-segerie) · 2025-01-04T15:50:45.833Z · LW(p) · GW(p)

I would be happy to discuss in a dialogue about this. This seems to be an important topic, and I'm really unsure about many parameters here.

↑ comment by WCargo (Wcargo) · 2024-12-28T11:36:45.827Z · LW(p) · GW(p)

I agree with claim 2-3 but not with claim 1

I think « random physicist » is not super fair, it looks like from his stand point he indeed met physicist willing to do « alignment » research, and had backgrounds in research and developping theory
We didn’t find Phd student to work on alignment but also we didn’t try (at least not cesia / effisciences)
Its true that most of the people we find that wanted to work on the problem were the motivated ones, but from the point of view of the alignment problem still recruiting them could be a mistake (saturating the field etc)

Replies from: charbel-raphael-segerie

↑ comment by Charbel-Raphaël (charbel-raphael-segerie) · 2024-12-28T18:16:37.774Z · LW(p) · GW(p)

What do you think of my point about Scott Aaronson? Also, since you agree with points 2 and 3, it seems that you also think that the most useful work from last year didn't require advanced physics, so isn't this a contradiction with you disagreing with point 1?

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-12-27T21:18:21.551Z · LW(p) · GW(p)

hi John,

Let's talk about a hypothetical physicist-turned-alignment researcher who, (for no particular reason), we'll call John. This researcher needs to periodically repeat to himself that Only Physicists do Real Work, but also needs to write an Alignment Post-Mortem. Maybe he needs the Post-Mortem as a philosophical fig leaf to justify his own program lack of empirical grounding, or maybe he's honestly concerned but not very good at seeing the Planck in his own eye. Either way, he meets with two approaches to writing critiques - let's call them (again for no particular reason) "careful charitable engagement" and "provocative dismissal." Both can be effective, but they have very different impacts on community discourse. It turns out that careful engagement requires actually understanding what other researchers are doing, while provocative dismissal lets you write spicy posts from your armchair. Lo and behold, John endorses provocative dismissal as his Official Blogging Strategy, and continues writing critiques. (Actually the version which eventually emerges isn't even fully researched, it's a quite different version which just-so-happens to be even more dismissive of work he hasn't closely followed.)

Glad to hear you enjoyed ILIAD [LW · GW].

Best,

AGO

comment by Ryan Kidd (ryankidd44) · 2024-12-27T20:09:36.174Z · LW(p) · GW(p)

Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

I don't want to respond to the examples rather than the underlying argument, but it seems necessary here: this seems like a massively overconfident claim about ELK and debate that I don't believe is justified by popular theoretical worst-case objections. I think a common cause of "worlds where iterative design fails" is "iterative design seems hard and we stopped at the first apparent hurdle." Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., "no-go theorems"); but that is not the case here. If I felt confident that the theory-level objections to ELK [LW · GW] and debate [LW · GW] ruled out hodge-podge solutions [LW · GW], I would abandon hope in these research directions and drop them from the portfolio. But this "flinching away" would ensure that genuine progress on these thorny problems never happened. If Stephen Casper, et al. treated ELK as unsolvable, they would never have published "Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs". If Akbir Khan, et al. treated debate as unsolvable, they would never have published "Debating with More Persuasive LLMs Leads to More Truthful Answers." I consider these papers genuine progress towards hodge-podge alignment, which I consider a viable strategy towards "alignment MVPs." Many more such cases.

A crucial part of my worldview is that good science does not have to "begin with the entire end in mind." Exploratory research pushing the boundaries of the current paradigm might sometimes seem contraindicated by theoretical arguments, but achieve empirical results anyways. The theory-practise gap [LW · GW] can cut both ways: sometimes sound-seeming theoretical arguments fail to predict tractable empirical solutions. Just because travelling salesman is NP-hard doesn't mean that a P-time algorithm for a restricted subclass of the problem doesn't exist! I do not feel sufficiently confident in the popular criticisms of dominant prosaic AI safety strategies to rule them out; far from it. I want new researchers to download these criticisms, as there is valuable insight here, but "touch reality [AF · GW]" anyways, rather than flinch away from the messy work of iteratively probing and steering vast, alien matrices.

Maybe research directions like ELK and debate sizzle out because the theoretical objections hold up in practise. Maybe they sizzle out for unrelated reasons, such as weird features of neural nets or just because we don't have time for the requisite empirical tinkering. But we would never find out unless we tried! And we would never equip a generation of empirical researchers to iterate on candidate solutions until something breaks. It's this skill that I think is most lacking in alignment: turning theoretical builder-breaker moves into empirical experiments on frontier LLMs (the apparent most likely substrate of first-gen AGI) that move us iteratively closer to a sufficiently scalable hodge-podge alignment MVP.

Replies from: ryankidd44, sharmake-farah

↑ comment by Ryan Kidd (ryankidd44) · 2024-12-27T20:26:32.313Z · LW(p) · GW(p)

Some caveats:

A crucial part of the "hodge-podge alignment feedback loop" is "propose new candidate solutions, often grounded in theoretical models." I don't want to entirely focus on empirically fleshing out existing research directions to the exclusion of proposing new candidate directions. However, it seems that, often, new on-paradigm research directions emerge in the process of iterating on old ones!
"Playing theoretical builder-breaker" is an important skill and I think this should be taught more widely. "Iterators [LW · GW]," as I conceive of them, are capable of playing this game well, in addition to empirically testing these theoretical insights against reality. John, to his credit, did a great job of emphasizing the importance of this skill with his MATS workshops on the alignment game tree [LW · GW] and similar.
I don't want to entirely trust in alignments MVPs, so I strongly support empirical research that aims to show the failure modes of this meta-strategy. I additionally support the creation of novel strategic paradigms, though I think this is quite hard. IMO, our best paradigm-level insights as a field largely come from interdisciplinary knowledge transfer (e.g., from economics, game theory, evolutionary biology, physics), not raw-g ideas from the best physics postdocs. Though I wouldn't turn away a chance to create more von Neumann's, of course!

↑ comment by Noosphere89 (sharmake-farah) · 2024-12-27T20:44:53.540Z · LW(p) · GW(p)

Yeah, the worst-case ELK problem could well have no solution, but in practice alignment is solvable either by other methods or by having an ELK solution that does work on a large classes of AIs like neural nets, so Alice is plausibly making a big mistake, and a crux here is that I don't believe we will ever get no-go theorems, or even arguments to the standard level of rigor in physics because I believe alignment has pretty lax constraints, so a lot of solutions can appear.

The relevant sentence below:

Sure, in some worlds we can rule out entire classes of solutions via strong theoretical arguments (e.g., "no-go theorems"); but that is not the case here.

comment by Tahp · 2024-12-27T01:24:22.759Z · LW(p) · GW(p)

I am a physics PhD student. I study field theory. I have a list of projects I've thrown myself at with inadequate technical background (to start with) and figured out. I've convinced a bunch of people at a research institute that they should keep giving me money to solve physics problems. I've been following LessWrong with interest for years. I think that AI is going to kill us all, and would prefer to live for longer if I can pull it off. So what do I do to see if I have anything to contribute to alignment research? Maybe I'm flattering myself here, but I sound like I might be a person of interest for people who care about the pipeline. I don't feel like a great candidate because I don't have any concrete ideas for AI research topics to chase down, but it sure seems like I might start having ideas if I worked on the problem with somebody for a bit. I'm apparently very ok with being an underpaid gopher to someone with grand theoretical ambitions while I learn the material necessary to come up with my own ideas. My only lead to go on is "go look for something interesting in MATS and apply to it" but that sounds like a great way to end up doing streetlight research because I don't understand the field. Ideally, I guess I would have whatever spark makes people dive into technical research in a pretty low-status field for no money for long enough to produce good enough research which convinces people to pay their rent while they keep doing more, but apparently the field can't find enough of those that it's unwilling to look for other options.

I know what to do to keep doing physics research. My TA assignment effectively means that I have a part-time job teaching teenagers how to use Newton's laws so I can spend twenty or thirty hours a week coding up quark models. I did well on a bunch of exams to convince an institution that I am capable of the technical work required to do research (and, to be fair, I provide them with 15 hours per week of below-market-rate intellectual labor which they can leverage into tuition that more than pays my salary), so now I have a lot of flexibility to just drift around learning about physics I find interesting while they pay my rent. If someone else is willing to throw 30,000 dollars per year at me to think deeply about AI and get nowhere instead of thinking deeply about field theory to get nowhere, I am not aware of them. Obviously the incentives are perverse to just go around throwing money at people who might be good at AI research, so I'm not surprised that I've only found one potential money spigot for AI research, but I had so many to choose from for physics.

Replies from: DusanDNesic, Buck, ryankidd44, lahwran, william-brewer, ete, Linda Linsefors, nc

↑ comment by DusanDNesic · 2024-12-27T10:04:03.863Z · LW(p) · GW(p)

It sounds like you should apply for the PIBBSS Fellowship! (https://pibbss.ai/fellowship/)

↑ comment by Buck · 2024-12-27T06:56:49.438Z · LW(p) · GW(p)

Going to MATS is also an opportunity to learn a lot more about the space of AI safety research, e.g. considering the arguments for different research directions and learning about different opportunities to contribute. Even if the "streetlight research" project you do is kind of useless (entirely possible), doing MATS is plausibly a pretty good option.

Replies from: TsviBT

↑ comment by TsviBT · 2024-12-27T11:21:02.604Z · LW(p) · GW(p)

MATS will push you to streetlight much more unless you have some special ability to have it not do that.

Replies from: Buck, quinn-dougherty

↑ comment by Buck · 2024-12-27T15:43:30.476Z · LW(p) · GW(p)

Do you mean during the program? Sure, maybe the only MATS offers you can get are for projects you think aren't useful--I think some MATS projects are pretty useless (e.g. our dear OP's). But it's still an opportunity to argue with other people about the problems in the field and see whether anyone has good justifications for their prioritization. And you can stop doing the streetlight stuff afterwards if you want to.

Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P

Replies from: TsviBT

↑ comment by TsviBT · 2024-12-27T16:14:51.450Z · LW(p) · GW(p)

Remember that the top-level commenter here is currently a physicist, so it's not like the usefulness of their work would be going down by doing a useless MATS project :P

Yes it would! It would eat up motivation and energy and hope that they could have put towards actual research. And it would put them in a social context where they are pressured to orient themselves toward streetlighty research--not just during the program, but also afterward. Unless they have some special ability to have it not do that.

Without MATS: not currently doing anything directly useful (though maybe indirectly useful, e.g. gaining problem-solving skill). Could, if given $30k/year, start doing real AGI alignment thinking from scratch not from scratch, thereby scratching their "will you think in a way that unlocks understanding of strong minds" lottery ticket that each person gets.

With MATS: gotta apply to extension, write my LTFF grant. Which org should I apply to? Should I do linear probes software engineering? Or evals? Red teaming? CoT? Constitution? Hyperparamter gippity? Honeypot? Scaling supervision? Superalign, better than regular align? Detecting deception?

Replies from: ryankidd44

↑ comment by Ryan Kidd (ryankidd44) · 2024-12-27T19:25:54.746Z · LW(p) · GW(p)

Obviously I disagree with Tsvi regarding the value of MATS to the proto-alignment researcher; I think being exposed to high quality mentorship and peer-sourced red-teaming of your research ideas is incredibly valuable for emerging researchers. However, he makes a good point: ideally, scholars shouldn't feel pushed to write highly competitive LTFF grant applications so soon into their research careers; there should be longer-term unconditional funding opportunities. I would love to unlock this so that a subset of scholars can explore diverse research directions for 1-2 years without 6-month grant timelines looming over them. Currently cooking something in this space.

↑ comment by Quinn (quinn-dougherty) · 2025-01-04T16:14:59.828Z · LW(p) · GW(p)

The upvotes and agree votes on this comment updated my perception of the rough consensus about mats and streetlighting. I previously would have expected less people to evaluate mats that way

↑ comment by Ryan Kidd (ryankidd44) · 2024-12-27T20:57:58.411Z · LW(p) · GW(p)

You could consider doing MATS as "I don't know what to do, so I'll try my hand at something a decent number of apparent experts consider worthwhile and meanwhile bootstrap a deep understanding of this subfield and a shallow understanding of a dozen other subfields pursued by my peers." This seems like a common MATS experience and I think this is a good thing.

↑ comment by the gears to ascension (lahwran) · 2024-12-27T11:49:11.435Z · LW(p) · GW(p)

The first step would probably be to avoid letting the existing field influence you too much. Instead, consider from scratch what the problems of minds and AI are, how they relate to reality and to other problems, and try to grab them with intellectual tools you're familiar with. Talk to other physicists and try to get into exploratory conversation that does not rely on existing knowledge. If you look at the existing field, look at it like you're studying aliens anthropologically.

↑ comment by yams (william-brewer) · 2024-12-27T11:37:12.871Z · LW(p) · GW(p)

[was a manager at MATS until recently and want to flesh out the thing Buck said a bit more]

It’s common for researchers to switch subfields, and extremely common for MATS scholars to get work doing something different from what they did at MATS. (Kosoy has had scholars go on to ARC, Neel scholars have ended up in scalable oversight, Evan’s scholars have a massive spread in their trajectories; there are many more examples but it’s 3 AM.)

Also I wouldn’t advise applying to something that seems interesting; I’d advise applying for literally everything (unless you know for sure you don’t want to work with Neel, since his app is very time intensive). The acceptance rate is ~4 percent, so better to maximize your odds (again, for most scholars, the bulk of the value is not in their specific research output over the 10 week period, but in having the experience at all).

Also please see Ryan’s replies to Tsvi on the talent needs report for more notes on the street lighting concern as it pertains to MATS. There’s a pretty big back and forth there (I don’t cleanly agree with one side or the other, but it might be useful to you).

↑ comment by plex (ete) · 2024-12-27T22:12:43.773Z · LW(p) · GW(p)

If you're mobile (able to be in the UK) and willing to try a different lifestyle, consider going to the EA hotel aka CEEALAR, they offer free food and accomodation for a bunch of people, including many people working on AI safety. Alternatively, taking a quick look at https://www.aisafety.com/funders, the current best options are maybe LTFF, OpenPhil, CLR, or maybe AE Studios?

↑ comment by Linda Linsefors · 2024-12-29T17:31:06.351Z · LW(p) · GW(p)

Is it an option to keep your current job but but spend your research hours on AI Safety instead of quarks? Is this something that would appealing to you + acceptable to your employer?

Given the current AI safety funding situation, I would strongly reccomend not giving up your current income.

I think that a lot of the pressure towards street light research comes from the funding situation. The grants are short and to stay in the game you need visible results quickly.

I think MATS could be good, if you can treat it as exploration, but not so good if you're in a hurry to get a job or a grant directly afterwards. Since MATS is 3 months of full time, it might not fit into your schedule (without quitting your job). Maybe instead try SPAR. Or see here for more options.

Or you can skip the training program route, and just start reading on your own. There's lots and lots of AI safety reding lists. I reccomend this one for you. @Lucius Bushnaq [LW · GW] who created and maintains it, also did quark physics, before switching to AI Safety. But if you don't like it, there are more options here under "Self studies".

In general, the funding situation in AI safety is pretty shit right now, but other than that, there are so many resources to help people get started. It's just a matter of choosing where to start.

Replies from: Lblack, Tahp

↑ comment by Lucius Bushnaq (Lblack) · 2024-12-29T17:59:35.741Z · LW(p) · GW(p)

(I have not maintained this list in many months, sorry.)

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2024-12-29T18:04:23.412Z · LW(p) · GW(p)

That's still pretty good. Most reading lists are not updated at all after publication.

↑ comment by Tahp · 2024-12-29T17:44:04.531Z · LW(p) · GW(p)

My current job is only offered to me on the condition that I am doing physics research. I have some flexibility to do other things at the same time though. The insights and resources you list seem useful to me, so thank you.

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2024-12-29T18:06:59.803Z · LW(p) · GW(p)

Ok, in that case I want to give you this post as inspiration.
Changing the world through slack & hobbies — LessWrong [LW · GW]

↑ comment by cdt (nc) · 2024-12-27T22:31:15.893Z · LW(p) · GW(p)

I am surprised that you find theoretical physics research less tight funding-wise than AI alignment [is this because the paths to funding in physics are well-worn, rather than better resourced?].

This whole post was a little discouraging. I hope that the research community can find a way forward.

comment by Ryan Kidd (ryankidd44) · 2024-12-28T03:10:43.516Z · LW(p) · GW(p)

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

At least from the MATS perspective, this seems quite wrong. Only ~20% of MATS scholars in the last ~4 programs have been undergrads. In the most recent application round, the dominant sources of applicants were, in order, personal recommendations, X posts, AI Safety Fundamentals courses, LessWrong, 80,000 Hours, then AI safety student groups. About half of accepted scholars tend to be students and the other half are working professionals.

comment by Thomas Kwa (thomas-kwa) · 2024-12-29T12:31:40.009Z · LW(p) · GW(p)

TLDR:

What OP calls "streetlighting", I call an efficient way to prioritize problems by tractability. This is only a problem insofar as we cannot also prioritize by relevance.
I think problematic streetlighting is largely due to incentives, not because people are not smart / technically skilled enough. Therefore solutions should fix incentives rather than just recruiting smarter people.

First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress has been made to generate agreement on them. All the lists of "core hard problems" OP lists are different, and Paul Christiano famously wrote a 27-point list of disagreements [LW · GW] on Eliezer's. This means that most people's views of the problem are wrong, and should they stick to their guns they might perseverate on either an irrelevant problem or a doomed approach.

I'd guess that historically perseveration has been an equally large problem as streetlighting among alignment researchers. Think of all the top alignment researchers in 2018 and all the agendas that haven't seen much progress. Decision theory should probably not take ~30% of researcher time like it did back in the day.^[1]

In fact, failure is especially likely for people who are trying to tackle "core hard problems" head-on, and not due to lack of intelligence. Many "core hard problems" are observations of lack of structure, or observations of what might happen in extreme generality e.g. Eliezer's

"We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers."
(summarized) "Outer optimization doesn't in general produce aligned inner goals", or
"Human beings cannot inspect an AGI's output to determine whether the consequences will be good."

which I will note are completely different type signature from subproblems that people can actually tractably research. Sometimes we fail to define a tractable line of attack. Other times these ill-defined problems get turned into entire subfields of alignment, like interpretability, which are filled with dozens of blind alleys of irrelevance that extremely smart people frequently fall victim to. For comparison, some examples of problems ML and math researchers can actually work on:

Unlearning: Develop a method for post-hoc editing a model, to make it as if it were never trained on certain data points
Causal inference: Develop methods for estimating the causation graph between events given various observational data.
Fermat's last theorem: Prove whether there are integer solutions to aⁿ + bⁿ = cⁿ.

The unit of progress is therefore not "core hard problems" directly, but methods that solve well-defined problems and will also be useful in scalable alignment plans. We must try to understand the problem and update our research directions as we go. Everyone has to pivot because the exact path you expected to solve a problem basically never works. But we have to update on tractability as well as relevance! For example, Redwood (IMO correctly) pivoted away from interp because other plans seemed viable (relevance) and it seemed too hard to explain enough AI cognition through interpretability to solve alignment (tractability).^[2]

OP seems to think flinching away from hard problems is usually cope / not being smart enough. But OP's list of types of cope are completely valid as either fundamental problem-solving strategies or prioritization. (4 is an incentives problem, which I'll come back to later)

Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. [...]
Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.

1 and 2 are fundamental problem-solving techniques. 1 is a crucial part of Polya's step 1: understand the problem, and 2 is a core technique for actually solving the problem. I don't like relying on 3 as stated, but there are many valid reasons for focusing on near-term AI^[3].

Now I do think there is lots of distortion of research in unhelpful directions related to (1, 2, 3), often due to publication incentives.^[4] But understanding the problem and solving easier versions of it has a great track record in complicated engineering; you just have to solve the hard version eventually (assuming we don't get lucky with alignment being easy, which is very possible but we shouldn't plan for).

So to summarize my thoughts:

Streetlighting is real, but much of what OP calls streetlighting is a justified focus on tractability.
We can only solve "core hard problems" by creating tractable well-defined problems
OP's suggested solution-- higher intelligence and technical knowledge-- doesn't seem to fit the problem.
- There are dozens of ML PhDs, physics PhDs, and comparably smart people working on alignment. As Ryan Kidd pointed out, the stereotypical MATS student is now a physics PhD or technical professional. And presumably according to OP, most people are still streetlighting.
- Technically skilled people seem equally susceptible to incentives-driven streetlighting, as well as perseveration.
- If the incentives continue to be wrong, people who defy them might be punished anyway.
Instead, we should fix incentives, maybe like this:
- Invest in making "core hard problems" easier to study
- Reward people who have alignment plans that at least try to scale to superintelligence
- Reward people who think about whether others' work will be helpful with superintelligence
- Develop things like alignment workshops, so people have a venue to publish genuine progress that is not legible to conferences
- Pay researchers with illegible results more to compensate for their lack of publication / social rewards

^{^}
MIRI's focus on decision theory is itself somewhat due to streetlighting. As I understand, 2012ish MIRI leadership's worldview was that several problems had to be solved for AI to go well, but the one they could best hire researchers for was decision theory, so they did lots of that. Also someone please correct me on the 30% of researcher time claim if I'm wrong.
^{^}
OP's research is not immune to this. My sense is that selection theorems [LW · GW] would have worked out if there had been more and better results.
^{^}
e.g. if deploying on near-term AI will yield empirical feedback needed to stay on track, significant risk comes from near-term AI, near-term AI will be used in scalable oversight schemes, ...
^{^}
As I see it, there is lots of distortion by the publishing process now that lots of work is being published. Alignment is complex enough that progress in understanding the problem is a large enough quantity of work to be a paper. But in a paper, it's very common to exaggerate one's work, especially the validity of the assumptions^[5], and people need to see through this for the field to function smoothly.
^{^}
I am probably guilty of this myself, though I try to honestly communicate my feelings about the assumptions in a long limitations section

comment by Stephen Fowler (LosPolloFowler) · 2024-12-27T03:05:54.337Z · LW(p) · GW(p)

Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."

In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.

The incentive structure within an organization influences which game people play. When feedback on decisions is immediate and substantial, individuals are more likely to engage in the outcome game. Hanson argues that capitalism's key strength is its ability to make outcome games more relevant.

However, if an organization is insulated from the consequences of its decisions or feedback is delayed, playing the consensus game becomes the best strategy for gaining resources and influence.

This dynamic is particularly relevant in the field of (existential) AI Safety, which needs to develop strategies to mitigate risks before AGI is developed. Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

As a result, it is unsurprising that most alignment efforts avoid tackling seemingly intractable problems. The incentive structures in the field encourage individuals to play the consensus game instead.

Replies from: TsviBT, stavros

↑ comment by TsviBT · 2024-12-27T11:19:45.171Z · LW(p) · GW(p)

Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There's plenty of feedback, that's not already being taken advantage of, at the level of "abstract, high-level, philosophy of mind", about the questions of alignment.

Replies from: philh

↑ comment by philh · 2025-01-04T00:54:28.691Z · LW(p) · GW(p)

That's not really "concrete" feedback though, right? In the outcome game/consensus game dynamic Stephen's talking about, it seems hard to play an outcome game with that kind of feedback.

Replies from: TsviBT

↑ comment by TsviBT · 2025-01-04T01:31:00.203Z · LW(p) · GW(p)

I'm not sure what "concrete" is supposed to mean; for the one or two senses I immediately imagine, no, I would say the feedback is indeed concrete. In terms of consensus/outcome, no, I think the feedback is actually concrete. There is a difficulty, which is that there's a much smaller set of people to whom the outcomes are visible.

As an analogy/example: feedback in higher math. It's "nonconcrete" in that it's "just verbal arguments" (and translating those into something much more objective, like a computer proof, is a big separate long undertaking). And there's a much smaller set of people who can tell what statements are true in the domain. There might even be a bunch more people who have opinions, and can say vaguely related things that other non-experts can't distinguish from expert statements, and who therefore form an apparent consensus that's wrong + ungrounded. But one shouldn't conclude from those facts that math is less real, or less truthtracking, or less available for communities to learn about directly.

↑ comment by stavros · 2024-12-27T11:40:04.310Z · LW(p) · GW(p)

Thanks for linking this post. I think it has a nice harmony with Prestige vs Dominance status games.

I agree that this is a dynamic that is strongly shaping AI Safety, but would specify that it's inherited from the non-profit space in general - EA originated with the claim that it could do outcome focused altruism, but.. there's still a lot of room for improvement, and I'm not even sure we're improving.

The underlying dynamics and feedback loops are working against us, and I don't see evidence that core EA funders/orgs are doing more than pay lip service to this problem.

comment by Jozdien · 2024-12-26T19:46:29.508Z · LW(p) · GW(p)

Thank you for writing this post. I'm probably slightly more optimistic than you on some of the streetlighting approaches, but I've also been extremely frustrated that we don't have anything better, when we could.

That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair.

I've seen discussions from people who I vehemently disagreed that did similar things and felt very frustrated by not being able to defend my views with greater bandwidth. This isn't a criticism of this post - I think a non-zero number of those are plausibly good - but: I'd be happy to talk at length with anyone who feels like this post is unfair to them, about our respective views. I likely can't do as good a job as John can (not least because our models aren't identical), but I probably have more energy for talking to alignment researchers^[1].

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.
... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.
Who To Recruit Instead
We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendations to readers to roll up their own doctorate program" [LW(p) · GW(p)]. Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

I disagree on two counts. I think people simply not thinking about what it would take to make superintelligent AI go well is a much, much bigger and more common cause for failure than the others, including flinching away. Getting traction on hard problems would solve the problem only if there weren't even easier-traction (or more interesting) problems that don't help. Very anecdotally, I've talked to some extremely smart people who I would guess are very good at making progress on hard problems, but just didn't think too hard about what solutions help.

I think the skills to do that may be correlated with physics PhDs, but more weakly. I don't think recruiting smart undergrads was a big mistake for that reason. Then again, I only have weak guesses as to what things you should actually select for such that you get people with these skills - there's still definitely failure modes like people who find the hard problems, and aren't very good at making traction on them (or people who overshoot on finding the hard problem and work on something nebuluously hard).

My guess would be that a larger source of "what went wrong" follows from incentives like "labs doing very prosaic alignment / interpretability / engineering-heavy work" -> "selecting for people who are very good engineers or the like" -> "selects for people who can make immediate progress on hand-made problems without having to spend a lot of time thinking about what broad directions to work on or where locally interesting research problems are not-great for superintelligent AI".

^{^}
In the past I've done this much more adversarially than I'd have liked, so if you're someone who was annoyed at having such a conversation with me before - I promise I'm trying to be better about that.

Replies from: quinn-dougherty, quinn-dougherty

↑ comment by Quinn (quinn-dougherty) · 2025-01-04T16:27:46.243Z · LW(p) · GW(p)

Very anecdotally, I've talked to some extremely smart people who I would guess are very good at making progress on hard problems, but just didn't think too hard about what solutions help.

A few of the dopest people i know, who id love to have on the team, fall roughly into the category of "engaged and little with lesswrong, grok the core pset better than most 'highly involved' people, but are working on something irrelevant and not even trying cuz they think it seems too hard". They have some thoughtful p(doom), but assume they're powerless.

↑ comment by Quinn (quinn-dougherty) · 2025-01-04T16:24:23.949Z · LW(p) · GW(p)

Richard ngo tweeted recently that it was a mistake to design the agi safety fundamentals curriculum to be broadly accessible, that if he could do it over again thered be punishing problem sets that alienate most people

Replies from: Will Aldred

↑ comment by _will_ (Will Aldred) · 2025-01-04T16:50:17.285Z · LW(p) · GW(p)

Any chance you have a link to this tweet? (I just tried control+f'ing through @Richard [LW · GW]'s tweets over the past 5 months, but couldn't find it.)

Replies from: ricraz, Jozdien, quinn-dougherty

↑ comment by Richard_Ngo (ricraz) · 2025-01-05T02:57:04.389Z · LW(p) · GW(p)

FWIW twitter search is ridiculously bad, it's often better to use google instead. In this case I had it as the second result when I googled "richardmcngo twitter safety fundamentals" (richardmcngo being my twitter handle).

↑ comment by Jozdien · 2025-01-04T22:43:08.045Z · LW(p) · GW(p)

I believe this is the tweet.

↑ comment by Quinn (quinn-dougherty) · 2025-01-04T17:57:53.053Z · LW(p) · GW(p)

I tried a little myself too. Hope I didn't misremembering.

comment by Stephen McAleese (stephen-mcaleese) · 2024-12-27T12:15:10.395Z · LW(p) · GW(p)

Here's a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:

My take on Ali Rahimi's "Test of Time" award talk at NIPS.
Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.
The main message was, in essence, that the current practice in machine learning is akin to "alchemy" (his word).
It's insulting, yes. But never mind that: It's wrong!
Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly in deep learning.
Understanding (theoretical or otherwise) is a good thing. It's the very purpose of many of us in the NIPS community.
But another important goal is inventing new methods, new techniques, and yes, new tricks.
In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science.
Why? Because theorists will spontaneously study "simple" phenomena, and will not be enticed to study a complex one until there a practical importance to it.
Criticizing an entire community (and an incredibly successful one at that) for practicing "alchemy", simply because our current theoretical tools haven't caught up with our practice is dangerous.
Why dangerous? It's exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, *despite* ample empirical evidence that they worked very well in many situations.
Neural nets, with their non-convex loss functions, had no guarantees of convergence (though they did work in practice then, just as they do now). So people threw the baby with the bath water and focused on "provable" convex methods or glorified template matching methods (or even 1957-style random feature methods).
Sticking to a set of methods just because you can do theory about it, while ignoring a set of methods that empirically work better just because you don't (yet) understand them theoretically is akin to looking for your lost car keys under the street light knowing you lost them someplace else.
Yes, we need better understanding of our methods. But the correct attitude is to attempt to fix the situation, not to insult a whole community for not having succeeded in fixing it yet. This is like criticizing James Watt for not being Carnot or Helmholtz.
I have organized and participated in numerous workshops that bring together deep learners and theoreticians, many of them hosted at IPAM. As a member of the scientific advisory board of IPAM, I have seen it as one of my missions to bring deep learning to the attention of the mathematics community. In fact, I'm co-organizer of such a workshop at IPAM in February 2018 ( http://www.ipam.ucla.edu/.../new-deep-learning-techniques/ ).
Ali: if you are not happy with our understanding of the methods you use everyday, fix it: work on the theory of deep learning, instead of complaining that others don't do it, and instead of suggesting that the Good Old NIPS world was a better place when it used only "theoretically correct" methods. It wasn't.

He describes how engineering artifacts often precede theoretical understanding and that deep learning worked empirically for a long time before we began to understand it theoretically. He says that researchers ignored deep learning because it didn't fit into their existing models of how learning should work.

I think the high-level lesson from the Facebook post is that street-lighting occurs when we try to force reality to be understood in terms of our existing models of how it should work (incorrect models like phlogiston are common in the history of science). Though this LessWrong post argues that street-lighting occurs when researchers have a bias towards working on easier problems.

Instead a better approach is to allow reality and evidence to dictate how we create our models of the world even if those more correct models are more complex or require major departures from existing models (which creates a temptation to 'flinch away'). I think a prime example of this is quantum mechanics: my understanding of how it was developed was that physicists noticed bizarre results from experiments like the double-split experiment and developed new theories (e.g. wave-particle duality) that described reality well even if they were counterintuitive or novel.

I guess the modern equivalent that's relevant to AI alignment would be Singular Learning Theory which proposes a novel theory to explain how deep learning generalizes.

Replies from: Linda Linsefors

↑ comment by Linda Linsefors · 2024-12-29T19:09:44.576Z · LW(p) · GW(p)

I guess the modern equivalent that's relevant to AI alignment would be Singular Learning Theory which proposes a novel theory to explain how deep learning generalizes.

I think Singular Learning Theory was developed independently of deep learning, and is not specifically about deep learning. It's about any learning system, under some assumptions, which are more general than the assumptions for normal Learning Theory. This is why you can use SLT but not normal Learning Theory when analysing NNs. NNs break the assumptions for normal Learning Theory but not for SLT.

comment by Nate Showell · 2024-12-26T23:20:01.330Z · LW(p) · GW(p)

My view of the development of the field of AI alignment is pretty much the exact opposite of yours: theoretical agent foundations research, what you describe as research on the hard parts of the alignment problem, is a castle in the clouds. Only when alignment researchers started experimenting with real-world machine learning models did AI alignment become grounded in reality. The biggest epistemic failure in the history of the AI alignment community was waiting too long to make this transition.

Early arguments for the possibility of AI existential risk (as seen, for example, in the Sequences) were largely based on 1) rough analogies, especially to evolution, and 2) simplifying assumptions about the structure and properties of AGI. For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.

As neural networks started to see increasing success at a wide variety of problems in the mid-2010s, it started to become apparent that the analogies and assumptions behind early AI x-risk cases didn't apply to them. The process of developing an ML model isn't very similar to evolution. Neural networks use finite amounts of compute, have internals that can be probed and manipulated, and behave in ways that can't be rounded off to decision theory. On top of that, it became increasingly clear as the deep learning revolution progressed that even if agent foundations research did deliver accurate theoretical results, there was no way to put them into practice.

But many AI alignment researchers stuck with the agent foundations approach for a long time after their predictions about the structure and behavior of AI failed to come true. Indeed, the late-2000s AI x-risk arguments still get repeated sometimes, like in List of Lethalities. It's telling that the OP uses worst-case ELK as an example of one of the hard parts of the alignment problem; the framing of the worst-case ELK problem doesn't make any attempt to ground the problem in the properties of any AI system that could plausibly exist in the real world, and instead explicitly rejects any such grounding as not being truly worst-case.

Why have ungrounded agent foundations assumptions stuck around for so long? There are a couple factors that are likely at work:

Agent foundations nerd-snipes people. Theoretical agent foundations is fun to speculate about, especially for newcomers or casual followers of the field, in a way that experimental AI alignment isn't. There's much more drudgery involved in running an experiment. This is why I, personally, took longer than I should have to abandon the agent foundations approach.
Game-theoretic arguments are what motivated many researchers to take the AI alignment problem seriously in the first place. The sunk cost fallacy then comes into play: if you stop believing that game-theoretic arguments for AI x-risk are accurate, you might conclude that all the time you spent researching AI alignment was wasted.

Rather than being an instance of the streetlight effect, the shift to experimental research on AI alignment was an appropriate response to developments in the field of AI as it left the GOFAI era. AI alignment research is now much more grounded in the real world than it was in the early 2010s.

Replies from: habryka4, habryka4, rhollerith_dot_com

↑ comment by habryka (habryka4) · 2024-12-28T20:54:33.908Z · LW(p) · GW(p)

Given that you speak with such great confidence that historical arguments for AI X-risk were not grounded, can you give me any "grounded" predictions about what superintelligent systems will do? (which I think we both agree is ultimately what will determine the fate of the world and universe)

If you make some concrete predictions then we can start arguing about the validity, but I find this kind of "mightier than thou" attitude where people keep making ill-defined statements like "these things are theoretical and don't apply", but without actually providing any answers to the crucial questions.

Indeed, not only that, I am confident that if you were to try to predict what will happen with superintelligence, you would very quickly start drawing on the obvious analogies to optimizers and dutch book arguments and evolution and goodhearts law, because we really don't have anything better.

Replies from: Nate Showell

↑ comment by Nate Showell · 2024-12-28T22:28:28.430Z · LW(p) · GW(p)

Some concrete predictions:

The behavior of the ASI will be a collection of heuristics that are activated in different contexts.
The ASI's software will not have any component that can be singled out as the utility function, although it may have a component that sets a reinforcement schedule.
The ASI will not wirehead.
The ASI's world-model won't have a single unambiguous self-versus-world boundary. The situational awareness of the ASI will have more in common with that of an advanced meditator than it does with that of an idealized game-theoretic agent.

Replies from: habryka4, leon-lang

↑ comment by habryka (habryka4) · 2024-12-28T22:35:18.876Z · LW(p) · GW(p)

I... am not very impressed by these predictions.

First, I don't think these are controversial predictions on LW (yes, a few people might disagree with him, but there is little boldness or disagreement with widely held beliefs in here), but most importantly, these predictions aren't about anything I care about. I don't care whether the world-model will have a single unambiguous self-versus-world boundary, I care whether the system is likely to convert the solar system into some form of computronium, or launch Dyson probes, or eliminate all potential threats and enemies, or whether the system will try to subvert attempts at controlling it, or whether it will try to amass large amounts of resources to achieve its aims, or be capable of causing large controlled effects via small information channels, or is capable of discovering new technologies with great offensive power.

The only bold prediction here is maybe "the behavior of the ASI will be a collection of heuristics", and indeed would take a bet against this. Systems under reflection and extensive self-improvement stop being well-described by contextual heuristics, and it's likely ASI will both self-reflect and self-improve (as we are trying really hard to cause both to happen). Indeed, I already wouldn't particularly describe Claude as a collection of contextual heuristics, there is really quite a lot of consistent personality in there (which of course, you can break with jailbreaks and stuff, but clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?).

Replies from: Signer

↑ comment by Signer · 2025-01-05T17:42:33.409Z · LW(p) · GW(p)

clearly the system is a lot less contextual than base models, and it seems like you are predicting a reversal of that trend?

The trend may be bounded, the trend may not go far by the time AI can invent nanotechnology - would be great if someone actually measured such things.

And there being a trend at all is not predicted by utility-maximization frame, right?

↑ comment by Leon Lang (leon-lang) · 2024-12-29T00:38:57.331Z · LW(p) · GW(p)

“heuristics activated in different contexts” is a very broad prediction. If “heuristics” include reasoning heuristics, then this probably includes highly goal-oriented agents like Hitler.

Also, some heuristics will be more powerful and/or more goal-directed, and those might try to preserve themselves (or sufficiently similar processes) more so than the shallow heuristics. Thus, I think eventually, it is plausible that a superintelligence looks increasingly like a goal-maximizer.

↑ comment by habryka (habryka4) · 2024-12-28T22:46:31.802Z · LW(p) · GW(p)

For example, agent foundations research sometimes assumes that AGI has infinite compute or that it has a strict boundary between its internal decision processes and the outside world.

It's one of the most standard results in ML that neural nets are universal function approximators. In the context of that proof, ML de-facto also assumes that you have infinite computing power. It's just a standard tool in ML, AI or CS to see what models predict when you take them to infinity. Indeed, it's really one of the most standard tools in the modern math toolbox, used by every STEM discipline I can think of.

Similarly, separating the boundary between internal decision processes and the outside world continues to be a standard assumption in ML. It's really hard to avoid, everything gets very loopy and tricky, and yes, we have to deal with that loopiness and trickiness, but if anything, agent foundations people were the actual people trying to figure out how to handle that loopiness and trickiness, whereas the ML community really has done very little to handle it. In contrary to your statement here, people on LW have been for years pointing out how embedded agency is really important, and been dismissed by active practitioners because they think the cartesian boundary here is just fine for "real" and "grounded" applications like "predicting the next token" which clearly don't have relevance to these weird and crazy scenarios about power-seeking AIs developing contextual awareness that you are talking about.

↑ comment by RHollerith (rhollerith_dot_com) · 2024-12-27T00:47:14.639Z · LW(p) · GW(p)

You do realize that by "alignment", the OP (John) is not talking about techniques that prevent an AI that is less generally capable than a capable person from insulting the user or expressing racist sentiments?

We seek a methodology for constructing an AI that either ensures that the AI turns out not to be able to easily outsmart us or (if it does turn out to be able to easily outsmart us) ensures (or makes it unlikely) that it won't kill us all or do something other terrible thing. (The former is not researched much compared to the latter, but I felt the need to include it for completeness.)

The way it is now, it is not even clear whether you and the OP (John) are talking about the same thing (because "alignment" has come to have a broad meaning).

If you want to continue the conversation, it would help to know whether you see a pressing need for a methodology of the type I describe above. (Many AI researchers do not: they think that outcomes like human extinction are quite unlikely or at least easy to avoid.)

comment by Zach Stein-Perlman · 2024-12-27T02:06:51.260Z · LW(p) · GW(p)

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.

Yep. This post is not for me but I'll say a thing that annoyed me anyway:

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. "sensor tampering") or giving novel arguments that problems are difficult is socially rewarded.)

Replies from: TsviBT, Zach Stein-Perlman, Zach Stein-Perlman

↑ comment by TsviBT · 2024-12-27T11:23:40.884Z · LW(p) · GW(p)

Does this actually happen?

Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud "ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it's vaguely related to AI and safety".

Replies from: stavros, Zach Stein-Perlman

↑ comment by stavros · 2024-12-27T11:55:58.354Z · LW(p) · GW(p)

To the extent that anecdata is meaningful:

I have met somewhere between 100-200 AI Safety people in the past ~2 years; people for whom AI Safety is their 'main thing'.

The vast majority of them are doing tractable/legible/comfortable things. Most are surprisingly naive; have less awareness of the space than I do (and I'm just a generalist lurker who finds this stuff interesting; not actively working on the problem).

Few are actually staring into the void of the hard problems; where hard here is loosely defined as 'unknown unknowns, here be dragons, where do I even start'.

Fewer still progress from staring into the void to actually trying things.

I think some amount of this is natural and to be expected; I think even in an ideal world we probably still have a similar breakdown - a majority who aren't contributing (yet)^[1], a minority who are - and I think the difference is more in the size of those groups.

I think it's reasonable to aim for a larger, higher quality, minority; I think it's tractable to achieve progress through mindfully shaping the funding landscape.

^{^}
Think it's worth mentioning that all newbies are useless, and not all newbies remain newbies. Some portion of the majority are actually people who will progress to being useful after they've gained experience and wisdom.

Replies from: TsviBT

↑ comment by TsviBT · 2024-12-27T12:00:45.712Z · LW(p) · GW(p)

it's tractable to achieve progress through mindfully shaping the funding landscape

This isn't clear to me, where the crux (though maybe it shouldn't be) is "is it feasible for any substantial funders to distinguish actually-trying research from other".

↑ comment by Zach Stein-Perlman · 2024-12-27T14:55:14.057Z · LW(p) · GW(p)

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

Replies from: TsviBT

↑ comment by TsviBT · 2024-12-27T15:07:31.585Z · LW(p) · GW(p)

Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with "how to make world-as-we-know-it ending AGI without it killing everyone", is going to people who don't even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren't aware of it, don't you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-27T15:30:42.274Z · LW(p) · GW(p)

I do not.

On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.)

In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.

This doesn't feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn't be helpful category they'd want to know (and, given the "obviously," would figure that out).

Possibly the situation is very different in academia or MATS-land; for now I'm just talking about the people around me.

↑ comment by Zach Stein-Perlman · 2024-12-28T00:02:17.183Z · LW(p) · GW(p)

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-28T00:13:21.860Z · LW(p) · GW(p)

From the post:

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Replies from: Zach Stein-Perlman

↑ comment by Zach Stein-Perlman · 2024-12-28T00:16:26.490Z · LW(p) · GW(p)

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

Replies from: TsviBT

↑ comment by TsviBT · 2024-12-28T01:26:16.983Z · LW(p) · GW(p)

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, clamping "can an AI be very superhumanly capable" to "no". That clamping causes them to also not see the flaws in the plan "we'll deploy our AIs in a staged manner, see how they behave, and then recall them if they behave poorly", because they don't think RSI is feasible, they don't think extreme persuasion is feasible, etc.

A more real example is, say, people thinking of "structures for decision making", e.g. constitutions. You explain that these things, they are not reflectively stable. And now this person can't understand reflective stability in general, so they don't understand why steering vectors won't work, or why lesioning won't work, etc.

Another real but perhaps more controversial example: {detecting deception, retargeting the search, CoT monitoring, lesioning bad thoughts, basically anything using RL} all fail because creativity starts with illegible concomitants to legible reasoning.

(This post seems to be somewhat illegible, but if anyone wants to see more real examples of aspects of mind that people fail to remember, see https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html)

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-28T23:09:36.362Z · LW(p) · GW(p)

My impression, from conversations with many people, is that the claim which gets clamped to True is not "this research direction will/can solve alignment" but instead "my research is high value". So when I've explained to someone why their current direction is utterly insufficient, they usually won't deny some class of problems. They'll instead tell me that the research still seems valuable even though it isn't addressing a bottleneck, or that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.

(Though admittedly I usually try to "meet people where they're at", by presenting failure-modes which won't parse as weird to them. If you're just directly explaining e.g. dangers of internal RSI, I can see where people might instead just assume away internal RSI or some such.)

... and then if I were really putting in effort, I'd need to explain that e.g. being a useful part of a bigger solution (which they don't know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy. But usually I wrap up the discussion well before that point; I generally expect that at most one big takeaway from a discussion can stick, and if they already have one then I don't want to overdo it.

Replies from: TsviBT

↑ comment by TsviBT · 2024-12-29T01:54:29.392Z · LW(p) · GW(p)

the claim which gets clamped to True is not "this research direction will/can solve alignment" but instead "my research is high value".

This agrees with something like half of my experience.

that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.

Right, I think of this response as arguing that streetlighting is a good way to do large-scale pre-paradigm science projects in general. And I have to somewhat agree with that.

Then I argue that AGI alignment is somewhat exceptional: 1. cruel deadline, 2. requires understanding as-yet-unconceived aspects of Mind. Point 2 of exceptionality goes through things like alienness of creativity, RSI, reflective instability, the fact that we don't understand how values sit in a mind, etc., and that's the part that gets warped away.

I do genuinely think that the 2024 field of AI alignment would eventually solve the real problems via collective iterative streetlighting. (I even think it would eventually solve it in a hypothetical world where all our computers disappeared, if it kept trying.) I just think it'll take a really long time.

being a useful part of a bigger solution (which they don't know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy

Right, exactly. (I wrote about this in my ~~opaque gibberish~~ philosophically precise style here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html#1-summary)

↑ comment by Zach Stein-Perlman · 2024-12-27T17:06:01.554Z · LW(p) · GW(p)

I wonder whether John believes that well-liked research, e.g. Fabien's list [LW(p) · GW(p)], is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

Replies from: Buck, johnswentworth

↑ comment by Buck · 2024-12-27T18:00:53.252Z · LW(p) · GW(p)

I strongly suspect he thinks most of it is not valuable

↑ comment by johnswentworth · 2024-12-28T23:53:32.846Z · LW(p) · GW(p)

This is the sort of object-level discussion I don't want on this post, but I've left a comment [LW(p) · GW(p)] on Fabien's list.

comment by debrevitatevitae (cindy-wu) · 2025-01-01T00:42:17.658Z · LW(p) · GW(p)

Some broad points:

My interpretation of those at fault are the field builders and funders. That is part of the reason I quit doing alignment. The entire funding landscape feels incredibly bait and switch: come work for us! We are desperate for talent! The alignment problem is the hardest issue of the century! (Cue 2 years and an SBF later) Erm, no, we don't fund AI safety startups or interp, and we want to see tangible results in a few narrow domains...
In particular, I advocate for the concept of endorsing work with a nonlinear apparent progress rate. Call it 'slow work' or something. Often a lot of hard things look like they're getting nowhere but all the small failures add up to something big. This is also why I do not recommend MATS as a one size fits all solution for people joining the alignment field: some people do better with slow work, and with carefully thinking about where they are heading with direction and intent, not just putting their heads down to 'get something done'. In fact, this mindset gave me burnout earlier this year.
The people not at fault are those who are middle of the pack undergrads or not Physicists Doing Real Things. This is a system wide problem.

I agree with 'don't streetlight' and 'we should move the field towards riskier and harder projects'. For me, this means bets on indviduals, not tangible projects. What I mean by this, is similar to how Entrepreneur First makes bets on founders, not products, funders should make bets on people, who they believe with enough time, can make meaningful progress that's orthogonal to existing directions.

I don't agree at all with the elitist take of 'and that is why only physics postdocs (or those of similar capability - whatever that means) are only capable of doing real work', at all. This take is quite absurd to me, and frankly a little angering (because I have the impression it's quietly shared among certain circles the same way some STEM majors look down on non-STEM majors), but I think that was the goal, so achieved. In particular, here are the reasons why I disagree:

It isn't true that this skill can only be found in 'physics postdocs'. The ability to push through hard things and technical fluency can be gained by a good chunk of STEM degrees. Anyone can open a mathematics textbook and read.
Critical thinking is more important in my opinion than technical fluency, for avoiding falling into streetlights and deferring opinions to others.
On a meta level, beliefs like this lead to segregation of the research community in a way that is unhealthy. Promoting further segregration is not ideal. There must be balance. I think any sensible human is capable of maintaining their own opinions while taking in those of others.
Agree with the comments about the car and driver. My current opinion is physicists need to work with non-physicists. There is a risk otherwise of working only on interesting problems that lead to us maybe 'solving alignment' (with 139 caveats) by 2178.

Replies from: beelal

↑ comment by bilalchughtai (beelal) · 2025-01-01T15:28:21.474Z · LW(p) · GW(p)

In fact, this mindset gave me burnout earlier this year.

I relate pretty strongly to this. I think almost all junior researchers are incentivised to 'paper grind' for longer than is correct. I do think there are pretty strong returns to having one good paper for credibility reasons; it signals that you are capable of doing AI safety research, and thus makes it easier to apply for subsequent opportunities.

Over the past 6 months I've dropped the paper grind mindset and am much happier for this. Notably, were it not for short term grants where needing to visibly make progress is important, I would have made this update sooner. Another take that I have is that if you have the flexibility to do so (e.g. by already having stable funding, perhaps via being a PhD student), front-loading learning seems good. See here for a related take by Rohin. Making progress on hard problems requires understanding things deeply, in a way which making progress on easy problems that you could complete during e.g. MATS might not.

comment by Seth Herd · 2024-12-27T05:12:31.320Z · LW(p) · GW(p)

I think this lens of incentives and the "flinching away" concept are extremely valuable for understanding the field of alignment (and less importantly, everything else:).

I believe "flinching away" is the psychological tendency that creates bigger and more obvious-on-inspection "ugh fields [LW · GW]". I think this is the same underlying mechanism discussed as valence by Steve Byrnes [LW · GW]. Motivated reasoning is the name for the resulting cognitive bias. Motivated reasoning overlaps by experimental definition with confirmation bias, the one bias destroying society [LW · GW] in Scott Alexander's terms. After studying cognitive biases through the lens of neuroscience for years, I th nk motivated reasoning is severely hampering progress in alignment, as it is every other project. I have written about it a little in what is the most important cognitive bias to understand [LW(p) · GW(p)], but I want to address more thoroughly how it impacts alignment research.

This post makes a great start at addressing how that's happening.

I very much agree with the analysis of incentives given here: they are strongly toward tangible and demonstrable progress in any direction vaguely related to the actual problem at hand.

This is a largely separate topic, but I happen to agree that we probably need more experienced thinkers. I disagree that physicists are obviously the best sort of experienced thinkers. I have been a physicist (as an undergrad) and I have watched physicists get into other fields. Their contributions are valuable but far from the final word and are far better when they inspire or collaborate with others with real knowledge of the target field.

There is much more to say on incentives and the field as a whole, but the remainder deserves more careful thought and separate posts.

This analysis of biases and "flinching away" could be applied to many other approaches than the prosaic alignment you target here. I think you're correct to notice this about prosaic alignment, but it applies to many agent foundations approaches as well.

A relentless focus on the problem at hand, including its most difficult aspects, is absolutely crucial. Those difficult aspects include the theoretical concerns you link to up front, which prosaic alignment largely fails to address. But the difficult spots also include the inconvenient fact that the world is rushing toward building LLM-based or at least deep net based AGI very rapidly, and there are no good ideas about how to make them stop while we go look in a distant but more promising spot to find some keys. Most agent foundations work seems to flinch away from this aspect. Both broad schools largely flinch away from the social, political, and economic aspects of the problem.

We are a lens that can see its flaws, but we need to work to see them clearly. This difficult self-critique of locating our flinches and ugh fields is what we all as individuals, and the field as a collective, need to do to see clearly and speed up progress.

comment by Thane Ruthenis · 2024-12-27T21:04:46.758Z · LW(p) · GW(p)

I almost want to say that it sounds like we should recruit people from the same demographic as good startup founders. Almost.

Per @aysja [LW · GW]'s list, we want creative people with an unusually good ability to keep themselves on-track, who can fluently reason at several levels of abstraction, and who don't believe in the EMH. This fits pretty well with the stereotype of a successful technical startup founder – an independent vision, an ability to think technically and translate that technical vision into a product customers would want (i. e., develop novel theory and carry it across the theory-practice gap), high resilience in the face of adversity, high agency, willingness to believe you can spot an exploitable pattern where no-one did, etc.

... Or, at least, that is the stereotype of a successful startup founder from Paul Graham's essays. I expect that this idealized image diverges from reality in quite a few ways. (I haven't been following Silicon Valley a lot, but from what I've seen, I've not been impressed with all the LLM and LLM-wrapper startups. Which made me develop quite a dim image of what a median startup actually looks like.)

Still, when picking whom to recruit, it might be useful to adopt some of the heuristics Y Combinator/Paul Graham (claim to) employ when picking which startup-founder candidates to support?

(Connor Leahy also makes a similar point here [LW(p) · GW(p)]: that pursuing some ambitious non-templated vision in the real world is a good way to learn lessons that may double as insights regarding thorny philosophical problems.)

comment by Noosphere89 (sharmake-farah) · 2024-12-27T00:28:05.660Z · LW(p) · GW(p)

One particular way this issue could be ameliorated is by encouraging people to write up null results/negative results, and one part of your model here is that a null result doesn't get reported and thus other people don't hear about failure, while people do hear about success stories, meaning that there is a selection effect to work on successful programs, and no one hears about the failures to tackle the problem, which is bad for research culture, and negative results not being shown is a universal problem across fields.

Replies from: Seth Herd

↑ comment by Seth Herd · 2024-12-27T17:06:58.039Z · LW(p) · GW(p)

Definitely.

Lack of publicly reporting null results was a subtle but huge problem in cognitive neuroscience. It took a while to figure out just how much effort was being wasted running studies that others had already tried and not reported because results were null.

Alignment doesn't have the same journal gatekeeping system that filters out null results, but there's probably a pretty strong tendency to report less on lack of progress than actual progress.

So post about it if you worked hard at something and got nowhere. This is valuable information when others choose their problems and approaches.

I do see people doing this; it would probably be valuable if we did it more.

comment by plex (ete) · 2024-12-28T11:22:58.887Z · LW(p) · GW(p)

I mostly agree with the diagnosis of the problem, but have some different guesses about paths to try and get alignment on track.

I think the core difficulties of alignment are explained semi-acceptably, but in a scattered form which means that only the dedicated explorers with lots of time and good taste end up finding them. Having a high quality course which collects the best explainers we have to prepare people for trying to find a toehold, and noticing the gaps left and writing good things added to fill them, seems necessary for any additional group of people added to actually point in the right direction.

BlueDot's course seems strongly optimized to funnel people into the empirical/ML/lab alignment team pipeline, they have dropped the Agent Foundations module entirely, and their "What makes aligning AI difficult?" fast track is 3/5ths articles on RLHF/RLAIF (plus an intro to LLMs and a RA video). This is the standard recommendation, and there isn't a generally known alternative.

I tried to fix this with Agent Foundations for Superintelligent Robust-Alignment, but I think this would go a lot better if someone like @johnswentworth [LW · GW] took it over and polished it.

comment by jacquesthibs (jacques-thibodeau) · 2024-12-26T19:24:13.908Z · LW(p) · GW(p)

Putting venues aside, I'd like to build software (like AI-aided) to make it easier for the physics post-docs to onboard to the field and focus on the 'core problems' in ways that prevent recoil as much as possible. One worry I have with 'automated alignment'-type things is that it similarly succumbs to the streetlight effect due to models and researchers having biases towards the types of problems you mention. By default, the models will also likely just be much better at prosaic-style safety than they will be at the 'core problems'. I would like to instead design software that makes it easier to direct their cognitive labour towards the core problems.

I have many thoughts/ideas about this, but I was wondering if anything comes to mind for you beyond 'dedicated venues' and maybe writing about it.

comment by dr_s · 2024-12-30T07:47:02.952Z · LW(p) · GW(p)

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

I would say this problem plagues more than just alignment, it plagues all of science. Trying to do everything as a series of individual uncoordinated contributions with an authority on top acting only to filter based on approximate performance metrics has this effect.

comment by abramdemski · 2024-12-27T20:01:10.784Z · LW(p) · GW(p)

(And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

Mathematics?

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-27T20:20:06.509Z · LW(p) · GW(p)

High variance. A lot of mathematics programs allow one to specialize in fairly narrow subjects IIUC, which does not convey a lot of general technical skill. I'm sure there are some physics programs which are relatively narrow, but my impression is that physics programs typically force one to cover a pretty wide volume of material.

comment by Chris_Leong · 2024-12-27T01:52:01.760Z · LW(p) · GW(p)

If you wanted to create such a community, you could try spinning up a Discord server?

Replies from: LosPolloFowler

↑ comment by Stephen Fowler (LosPolloFowler) · 2024-12-27T02:23:03.513Z · LW(p) · GW(p)

I'm not saying that this would necessarily be a step in the wrong direction, but I don't think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers.

If moderating the server takes up a few hours of John's time per week the opportunity cost probably isn't worth it.

Replies from: caleb-biddulph

↑ comment by Caleb Biddulph (caleb-biddulph) · 2024-12-27T17:27:04.909Z · LW(p) · GW(p)

Maybe someone else could moderate it?

comment by Purplehermann · 2024-12-26T23:22:57.665Z · LW(p) · GW(p)

A few thoughts.

Have you checked what happens when you throw physic postdocs at the core issues - do they actually get traction or just stare at the sheer cliff for longer while thinking? Did anything come out of the Illiad meeting half a year later? Is there a reason that more standard STEMs aren't given an intro into some of the routes currently thought possibly workable, so they can feel some traction? I think either could be true- that intelligence and skills aren't actually useful right now, the problem is not tractable, or better onboarding could let the current talent pool get traction - and either way it might not be very cost effective to get physics postdocs involved.
Humans are generally better at doing things when they have more tools available. While the 'hard bits' might be intractable now, they could well be easier to deal with in a few years after other technical and conceptual advances in AI, and even other fields. (Something something about prompt engineering and Anthropic's mechanistic interpretability from inside the field and practical quantum computing outside).

This would mean squeezing every drop of usefulness out of AI at each level of capability, to improve general understanding and to leverage it into breakthroughs in other fields before capabilities increase further. In fact, it might be best to sabotage semiconductor/chip production once the models one gen before super-intelligence/extinction/ whatever, giving maximum time to leverage maximum capabilities and tackle alignment before the AIs get too smart.

How close is mechanistic interpretability to the hard problems, and what makes it not good enough?

comment by Quinn (quinn-dougherty) · 2025-01-04T16:01:11.233Z · LW(p) · GW(p)

As someone who, isolated and unfunded, went on months-long excursions into the hard version of the pset multiple times and burned out each time, I felt extremely validated when you verbally told me a fragment of this post around a fire pit at illiad. The incentives section of this post is very grim, but very true. I know naive patches to the funding ecosystem would also be bad (easy for grifters, etc), but I feel very much like I and we were failed by funders. I could've been stronger etc, I could've been in berkeley during my attempts instead of philly, but "why not just be heroically poor and work between shifts at a waged job" is... idk man, maybe fine with the right infrastructure, but i didnt have that infrastructure. (Again, I don't know a good way to have fixed the funding ecosystem, so funders reading this shouldn't feel too attacked).

(Epistemic status: have given up on the hard pset, but i think I have a path to adding some layers of Swiss cheese)

comment by Johannes C. Mayer (johannes-c-mayer) · 2024-12-28T22:17:41.293Z · LW(p) · GW(p)

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

I spend ~10 hours trying to teach people how to think. I sometimes try to intentionally cause this to happen. Usually you can recognize it by them starting to be quiet (I usually give the instruction that they should do all their thinking out loud). And this seems to be when actual cognitive labor is happening, instead of saying things that you already knew. Though usually they by default fail earlier than "realizing the hard parts of ELK".

Usually I need to tell them that actually they are doing great by thinking about the black wall more, and shouldn't now switch the topic.

Infact it seem to be a good general idea generation strategy to just write down all the easy ideas first, until you hit this wall, such that you can start to actually think.

Why Physicists are competent

Here is my current model after thinking about this for 30 minutes of why physicists are good at solving hard problems (not ever having studied physics extensively myself).

The job description of a physicist is basically "understand the world", meaning make models that have predictive power over the real world.

This is very different from math. In some sense a lot harder. In math you know everything. There is no uncertainty. And you have a very good method to verify that you are correct. If you have generated a proof, it's correct. It's also different from computer science for similar reasons.

But of cause physicists need to be very skilled at math, because if you are not skilled at math you can't make good models that have predictive power. Similarly physicists need to be good at computer science, to implement physicsal simulations, which often involve complex algorithms. And to be able to actually implement these algorithms such that they are fast enough, and run at all, they need to also be decent at software engeneering.

Also understanding the scientific method is a lot more important when you are physicist. It's sort of not required to understand science for doing math and theoretical CS.

Another thing is that physicists need actually do things that work. You can do some random math that's not useful at all. It seems harder to make a random model of reality that predicts some aspect of reality that you couldn't predict before, and have you not figure out anything important. As a physicist you are actually measured by how reality is. You can't go "hmm maybe this just doesn't work" like in math. Obviously somehow it works because it's reality, you just haven't figured out how to properly capture how reality is in your model.

Perhaps this trains physicist to not give up on problems, because the default assumption is that clearly there must be some way to model some part of reality, because reality is in some sense already a model of itself.

I think this is the most important cognitive skill. Not giving up. I think this is much more important than any particular pice of technical knowledge. Having technical knowledge is of cause required, but it seems that if you where to not give up on thinking how to solve a problem (that is hard but important) would make you end up learning whatever is required.

And in some sense it is this simple. When I see people run into a wall, and then have them stare at a wall they often have ideas that I like so much that I feel the need to write them down.

comment by RogerDearnaley (roger-d-1) · 2025-01-14T08:17:06.197Z · LW(p) · GW(p)

I have a Theoretical Physics PhD in String Field Theory from Cambridge — my reaction to hard problems is to try to find a way of cracking them that no-one else is trying. Please feel free to offer to fund me :-)

comment by Christopher King (christopher-king) · 2024-12-27T17:38:35.406Z · LW(p) · GW(p)

I think there is an obvious signal that could be used: a forecast of how much MIRI will like the research when asked in 5 years. (Note that I don't mean just asking MIRI now, but rather something like prediction markets or super-forecasters to predict what MIRI will say 5 years from now.)

Basically, if the forecast is above average, anyone who trusts MIRI should fund them.

Replies from: ryankidd44, Thane Ruthenis

↑ comment by Ryan Kidd (ryankidd44) · 2024-12-29T19:43:58.162Z · LW(p) · GW(p)

But who is "MIRI"? Most of the old guard have left. Do you mean Eliezer and Nate? Or a consensus vote of the entire staff (now mostly tech gov researchers and comms staff)?

↑ comment by Thane Ruthenis · 2024-12-29T20:03:42.958Z · LW(p) · GW(p)

Hm. Eliezer has frequently complained that the field has no recognition function for good research he's satisfied with besides "he personally looks at the research and passes his judgement", and that this obviously doesn't scale.

Stupid idea: Set up a grantmaker that funds proposals based on a prediction market tasked with evaluating how likely Eliezer/Nate/John is to approve of a given research project. Each round, after the funding is assigned to the highest-credence projects, Eliezer/Nate/John evaluate a random subset of proposals to provide a ground-truth signal; the corresponding prediction markets pay out, the others resolve N/A.

This should effectively train a reward function that emulates the judges' judgements in a scalable way.

Is there an obvious reason this doesn't work? (One possible issue is the amount of capital that'd need to be frozen in those markets by market participants, but we can e. g. upscale the effective amounts of money each participant has as some multiple of the actual dollars invested, based on how many of their bets are likely to actually pay out.)

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-12-29T20:31:36.612Z · LW(p) · GW(p)

Some notes:

I don't think this is the actual bottleneck here. Noteably, Eliezer, Nate, and John don't spend much of any of their time assessing research at all (at least recently) as far as I can tell.
I don't think a public market will add much information. Probably better to just have grantmakers with more context forecast and see how well they do. You need faster feedback loops than 1 yr to get anywhere though, but you can do this by practicing on a bunch of already done research.
My current view is that more of the bottleneck in grantmaking is not having good stuff to fund rather than grantmakers not funding stuff, though I do still think the Open Phil should fund notably more aggressively than they currently do, that marginal LTFF dollars look great, and that it's bad that Open Phil was substantially restricted in what they can fund recently (which I expect to have substantial chilling effects in addition to those areas).

Replies from: Thane Ruthenis

↑ comment by Thane Ruthenis · 2024-12-29T20:56:39.924Z · LW(p) · GW(p)

Noteably, Eliezer, Nate, and John don't spend much of any of their time assessing research at all (at least recently) as far as I can tell.

Perhaps not specific research projects, but they've communicated a lot regarding their models of what types of research are good/bad. (See e. g. Eliezer's list of lethalities [LW · GW], John's Why Not Just... [? · GW] sequence, this [LW · GW] post of Nate's.)
I would assume this is because this doesn't scale and their reviews are not, in any given instance, the ultimate deciding factor regarding what people do or what gets funded. Spending time evaluating specific research proposals is therefore cost-inefficient compared to reviewing general research trends/themes.

My current view is that more of the bottleneck in grantmaking is not having good stuff to fund

Because no entity that I know of is currently explicitly asking for proposals that Eliezer/Nate/John would fund. Why would people bother coming up with such proposals in these circumstances? The system explicitly doesn't select for it.

I expect that if there were an actual explicit financial pressure to goodhart to their preferences, much more research proposals that successfully do so would be around.

comment by catubc (cat-1) · 2025-02-07T06:49:16.213Z · LW(p) · GW(p)

Exactly, and thanks for writing this.

I would go further and say that - AI safety is AI dev - and this happened years ago. If we stopped it all now, we'd extend our timelines:

https://www.lesswrong.com/posts/vkzmbf4Mve4GNyJaF/the-case-for-stopping-ai-safety-research [LW · GW]

comment by Lorec · 2025-01-11T21:02:48.125Z · LW(p) · GW(p)

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

Yes, exactly. Unfortunately, actually doing this is impossible, so we all have to keep beating our heads against a wall just the same.

comment by Daniel Tan (dtch1997) · 2024-12-28T15:14:32.989Z · LW(p) · GW(p)

To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post

I'm curious, what's the distinction between "those researchers" and the actual intended audience of your post? Who is this meant to convince, if not the people you claim are being 'stupid internally'?

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-12-28T23:00:42.551Z · LW(p) · GW(p)

This post isn't intended to convince anyone at all that people are in fact streetlighting. This post is intended to present my own models and best guesses at what to do about it to people who are already convinced that most researchers in the field are streetlighting. They are the audience.

comment by Oliver Daniels (oliver-daniels-koch) · 2024-12-27T18:23:55.780Z · LW(p) · GW(p)

Yeah it does seem unfortunate that there's not a robust pipeline for tackling the "hard problem" (even conditioned to more "moderate" models of x-risk)

But (conditioned on "moderate" models) there's still a log of low-hanging fruit that STEM people from average universities (a group I count myself among) can pick. Like it seems good for Alice to bounce off of ELK and work on technical governance, and for Bob to make incremental progress on debate. The current pipeline/incentive system is still valuable, even if it systematically neglects tackling the "hard problem of alignment".

comment by Logan Zoellner (logan-zoellner) · 2024-12-27T17:23:45.961Z · LW(p) · GW(p)

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

I've always been sympathetic to the drunk in this story. If the key is in the light, there is a chance of finding it. If it is in the dark, he's not going to find it anyway so there isn't much point in looking there.

Given the current state of alignment research, I think it's fair to say that we don't know where the answer will come from. I support The Plan [LW · GW] and I hope research continues on it. But if I had to guess, alignment will not be solved via getting a bunch of physicists thinking about agent foundations. It will be solved by someone who doesn't know better making a discovery they "wasn't supposed to work".

On an interesting side here a fun story about experts repeatedly failing to make an obvious-in-hindsight discovery because they "knew better".

comment by Joel Burget (joel-burget) · 2024-12-27T16:14:23.102Z · LW(p) · GW(p)

A different way to think about types of work is within current ML paradigms vs outside of them. If you believe that timelines are short (e.g. 5 years or less), it makes much more sense to work within current paradigms, otherwise there's very little chance your work will become adopted in time to matter. Mainstream AI, with all of its momentum, is not going to adopt a new paradigm overnight.

If I understand you correctly, there's a close (but not exact) correspondence between work I'd label in-paradigm and work you'd label as "streetlighting". On my model the best reason to work in-paradigm is because that's where your work has a realistic chance to make a difference in this world.

So I think it's actually good to have a portfolio of projects (maybe not unlike the current mix), from moonshots to very prosaic approaches.

comment by Signer · 2024-12-28T13:18:59.697Z · LW(p) · GW(p)

But it is important, and this post just isn’t going to get done any other way.

Speaking about streetlighting...

The Field of AI Alignment: A Postmortem, and What To Do About It

Contents

What This Post Is And Isn't, And An Apology

Why The Streetlighting?

A Selection Model

Selection and the Labs

A "Flinching Away" Model

What To Do About It

How We Got Here

Who To Recruit Instead

Integration vs Separation

160 comments

Who To Recruit Instead

Why Physicists are competent