Have LLMs Generated Novel Insights?

post by abramdemski, Cole Wyeth (Amyr) · 2025-02-23T18:22:12.763Z · LW · GW · 5 comments

This is a question post.

Contents

  Answers
    4 Kaj_Sotala
    1 Archimedes
None
5 comments

In a recent post [LW · GW], Cole Wyeth makes a bold claim:

. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important. 

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].

I commented [LW(p) · GW(p)]:

An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.

My question is: can anyone confirm the above rumor, or cite any other positive examples of LLMs generating insights which help with a scientific or mathematical project, with those insights not being available anywhere else (ie seemingly absent from the training data)?

Cole Wyeth predicts "no"; though LLMs are able to solve problems which they have not seen by standard methods, they are not capable of performing novel research. I (Abram Demski) find it plausible (but not certain) that the answer is "yes". This touches on AI timeline questions.

I find it plausible that LLMs can generate such insights, because I think the predictive ground layer of LLMs [LW · GW] contains a significant "world-model" triangulated from diffuse information. This "world-model" can contain some insights not present in the training data. I think this paper has some evidence for such a conclusion: 

In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs  can articulate a definition of and compute inverses.

However, the setup in this paper is obviously artificial, setting up questions that humans already know the answers to, even if they aren't present in the data. The question is whether LLMs synthesize any new knowledge in this way.

Answers

answer by Kaj_Sotala · 2025-02-23T20:10:04.156Z · LW(p) · GW(p)

Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:

Introduction to the Context:

I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.

However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.

To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.

Prompt to O1-Pro:

Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:

“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”

Why Battle Royale Games?

You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!

Idea #9: A Remarkable Paradigm

Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding! 

“Adapt or Fail” Under Escalating Challenges:

Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective.  It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms. 

There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.

comment by Cole Wyeth (Amyr) · 2025-02-23T20:23:10.654Z · LW(p) · GW(p)

Thanks!

Certainly he seems impressed with the models understanding, but did it actually solve a standing problem? Did its suggestions actually work?

This is (also) outside my area of expertise, so need to see the idea verified by reality - or at least by professional consensus outside the project.

Mathematics (and mathematical physics, theoretical computer science, etc.) would be more clear-cut examples because any original ideas from the model could be objectively verified (without actually running experiments). Not to move the goalposts - novel insights in biology or chemistry would also count, its just hard for me to check whether they are significant, or whether models propose hundreds of ideas and most of them fail (e.g. the bottleneck is experimental resources).  

answer by Archimedes · 2025-02-23T22:14:35.639Z · LW(p) · GW(p)

I was literally just reading this before seeing your post:

https://www.techspot.com/news/106874-ai-accelerates-superbug-solution-completing-two-days-what.html

Arguably even more remarkable is the fact that the AI provided four additional hypotheses. According to Penadés, all of them made sense. The team had not even considered one of the solutions, and is now investigating it further.

comment by Cole Wyeth (Amyr) · 2025-02-23T22:18:04.873Z · LW(p) · GW(p)

So, the LLM generated five hypotheses, one of which the team also agrees with, but has not verified?

The article frames the extra hypotheses as making the results more impressive, but it seems to me that they make the results less impressive - if the LLM generates enough hypotheses, and you already know the answer, one of them is likely to sound like the answer. 

5 comments

Comments sorted by top scores.

comment by Matt Goldenberg (mr-hire) · 2025-02-23T18:42:36.572Z · LW(p) · GW(p)

I think this is one of the most important questions we currently have in relation to time to AGI, and one of the most important "benchmarks" that tell us where we are in terms of timelines.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-02-23T20:07:10.707Z · LW(p) · GW(p)

I agree; I will shift to an end-game strategy as soon as LLMs demonstrate the ability to automate research.

comment by Thane Ruthenis · 2025-02-23T22:21:41.004Z · LW(p) · GW(p)

The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the property of being insightful.

And my current impression is that no, there's no way to do that. If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.

This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it's very rare, and only happens when I've already done 90% of the work setting up the problem. Mostly they're useful as rubber ducks or primers on existing knowledge; not idea-generators.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-02-23T22:24:45.899Z · LW(p) · GW(p)

Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.

comment by silentbob · 2025-02-23T21:51:17.090Z · LW(p) · GW(p)

Random thought: maybe (at least pre-reasoning-models) LLMs are RLHF'd to be "competent" in a way that makes them less curious & excitable, which greatly reduces their chance of coming up with (and recognizing) any real breakthroughs. I would expect though that for reasoning models such limitations will necessarily disappear and they'll be much more likely to produce novel insights. Still, scaffolding and lack of context and agency can be a serious bottleneck.