Is theory good or bad for AI safety?
post by Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-19T10:32:08.772Z · LW · GW · 1 commentsContents
1 comment
We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard. (Kennedy’s famous “We chose to go to the moon” speech)
The ‘real’ mathematics of ‘real’ mathematicians, …, is almost wholly ‘useless’ (Hardy’s “A Mathematician’s Apology”)
If the "irrational" agent is outcompeting you on a systematic and predictable basis, then it is time to reconsider what you think is "rational". (Yudkowsky’s “Rationality is Systematized Winning [LW · GW]”)
Shut up and calculate (Merman, apparently)
I have been writing a long post about the modeling theory in different sciences, specifically with a focus on elegance/pragmatism tradeoffs and how to reason about them in the context of safety. It has ballooned up (as these things tend to do), and I'm probably going to write it in a few installments as a sequence.
But before going in, it's worth explaining why I think building better models and a better language here is crucial.
First, let's answer the question. Is theory good or bad?
If I were to summarise my position on this in one paragraph, it would be “it’s complicated”, with a Fiddler on the Roof-style sequence of ‘on the other hands’ to follow.
- On the one hand, more theory is necessary if we ever want a reasonable safety-relevant understanding of AI.
- On the other hand, a number of theoretical projects I have looked at that are adjacent to academic fields I know it are (at best) a thin excuse for scientists to get AI funding for their own hobbyhorse research or (at worst, and unfortunately often) a type of scientific word salad that gets funding because it has sexy keywords and none of the funders have the context to see that it’s devoid of content.
- On the other hand, when people internalize the risks of ungrounded research, I see a tendency to only acknowledge stuff they understand as legitimate, and dismiss anything else, to a point where even very mild new ideas that obviously improve known techniques are dismissed as “too much theory”.
- On the other hand, people who do theoretical research that I consider useful have a tendency to fall into “elegance traps”. Namely, after having promising early results or influencing the ideas in the field in a productive way, they try to extend their “one big program” to more and more depth or generality, far beyond the point where the program ceases to generate new exciting ideas or applications (beyond ideas that people within the field convince themselves are useful, or tacked-on applications that they wrongly convince themselves fall into their paradigm).
- On the other hand, many of the greatest scientific breakthroughs took place as a result of people following their nose in directions they felt are elegant despite the surrounding community taking a long time to understand or apply their ideas. (Cantor, Dirac, Dan Schekhtman, and Andrew Wiles are examples).
- On the other hand, if these scientists had invested in testing and explaining their ideas better, the breakthroughs could have come sooner.
And so on.
When I talk to my team at PIBBSS and my friends in AI safety, we have interesting, nuanced debates. My teammates have written about related things here [LW · GW], here [LW · GW] and here [LW · GW]. But when I look around, what dominates the discourse seem to be very low-context discussions of “THEORY GOOD” or “THEORY BAD”. Millions of dollars in funding are distributed on the premise of barely nuanced versions of one or the other of these slogans, and I don’t like it.
On the one hand, this isn’t an easily fixable situation where someone can just come in and explain what the right takes are. Questions about theory in AI are hard to reason about for a number of reasons.
- Getting enough context in a single paradigm to evaluate it takes a significant amount of research and reading, and it gets harder to do it (and even harder to then communicate the results) as the paradigm becomes more theoretical.
- Attribution is hard in science. It’s not entirely clear what it means that some idea or concept was “useful”. What parts of your cellphone would and would not exist without the theory of relativity? What is the counterfactual impact on modern biology of Darwin’s theory of evolution, and how would it be different if the discovery was made 50 years later? Etc.
- Relatedly, when evaluating the merits of a theoretical agenda, there are a lot of things to track. Questions of pragmatism, pluralism, elegance, etc., might quickly turn into an interconnected mess that’s hard to entangle and turn into a clear take.
- In established fields like math and physics, there has been an accumulation of institutional knowledge about “what is good theory vs. what is bad theory” and “when to try to build more fundamental models vs. when to shut up and calculate”. Not so in AI – while the field existed for a while as a subdiscipline of theoretical CS, the modern hypersonic development of the field and the empirical tools and complex behaviors we can study means that all these intuitions need to be rebuilt from scratch.
But on the other hand, the really awful state of the debate and the low "sanity waterline [LW · GW]" in institutional thinking about theory and fundamental science is surprising to me. There are extremely low-hanging fruit that are not being picked. There are useful things to say and useful models to build. And when I look around, I don’t see nearly as much effort as I’d like going into doing this.
What we lack here is not so much a "textbook of all of science that everyone needs to read and understand deeply before even being allowed to participate in the debate". Rather, we lack good, commonly held models of how to reason about what is theory, and good terms to (try to) coordinate around and use in debates and decisions.
The AI safety community, having much cultural and linguistic overlap with the lesswrong community (e.g. I am writing this here), has a lot of the machinery for building good models. I really liked the essays by Yudkowsky on science and scientists, like this one [? · GW]. I also really like the linked initiatives by Elizabeth Van Nostrand and Simon deDeo's group on trying to think more rigorously about path-dependence and attribution in the history of science (and getting my favorite kind of answer: it's complicated, but we can still kinda build better models).
I think there should be more work of this type. But at the same time, as I mentioned before, I think this community has as bit of an issue with reductionism [LW(p) · GW(p)]. This biases the community to reduce the core concepts in building theory to something mathy and precise -- "abstraction is description length" or "elegance is consilience". While these constitute valuable formal models and intuition pumps, they do not capture the fact that abstraction and elegance is its own kind of thing, like the notion of positional thinking in chess -- they're not equivalent to formal models thereof. Now I'm not about to say that there is some zen enlightenment that you will only attain once you have purified yourself at the altar of graduate school. These notions can be modeled well, I think, without having the lived experience, in the same way that a chess player can explain how she balances positional and tactical thinking to someone who does not have much experience in the game. A good baseline of concepts to coordinate around here is possible, it just hasn't (to the best of my knowledge) been built or internalized.
I want to point at Lauren's post here [LW · GW] in particular as a physics perspective on the notion of "something being physical" in valuable and non-reducible inherent notion that is useful and can contribute to better conceptualization here.
In the next couple of posts in this sequence I am hoping to build up a little more of such a language. I'm aware that I'll probably be reinventing the wheel a lot, and what I'll be giving is a limited take. The hope is that this will start a conversation where more people, perhaps with better ways of operationalizing this, will start coordinating on filling this gap with a bit of a consensus vocabulary.
1 comments
Comments sorted by top scores.
comment by Lorec · 2025-01-19T18:30:02.074Z · LW(p) · GW(p)
What we lack here is not so much a "textbook of all of science that everyone needs to read and understand deeply before even being allowed to participate in the debate". Rather, we lack good, commonly held models of how to reason about what is theory, and good terms to (try to) coordinate around and use in debates and decisions.
Yudkowsky's sequences [/Rationality: AI to Zombies] provide both these things. People did not read Yudkowsky's sequences and internalize the load-bearing conclusions enough to prevent the current poor state of AI theory discourse, though they could have. If you want your posts to have a net odds-of-humanity's-survival-improving impact on the public discourse on top of Yudkowsky's, I would advise that you condense your points and make the applications to concrete corporate actors, social contexts, and Python tools as clear as possible.
[ typo: 'Merman' -> 'Mermin' ]