OpenAI releases deep research agent

post by Seth Herd · 2025-02-03T12:48:44.925Z · LW · GW · 6 comments

This is a link post for https://openai.com/index/introducing-deep-research/

Contents

  How it works
None
6 comments

Today we’re launching deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.

Deep research is OpenAI's next agent that can do work for you independently—you give it a prompt, and ChatGPT will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst. Powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis, it leverages reasoning to search, interpret, and analyze massive amounts of text, images, and PDFs on the internet, pivoting as needed in reaction to information it encounters.

The ability to synthesize knowledge is a prerequisite for creating new knowledge. For this reason, deep research marks a significant step toward our broader goal of developing AGI, which we have long envisioned as capable of producing novel scientific research.

...

How it works

Deep research was trained using end-to-end reinforcement learning on hard browsing and reasoning tasks across a range of domains. Through that training, it learned to plan and execute a multi-step trajectory to find the data it needs, backtracking and reacting to real-time information where necessary. ... As a result of this training, it reaches new highs on a number of public evaluations focused on real-world problems.

 

Note that the pass rates are around 15-25%, so it's not outright replacing any experts—and it remains to be seen whether the results are even useful to experts, given that they usually do not 'pass.' We do not get much information on what types of tasks these are, but they are judged by "domain experts"

It scores 26% on Humanity's Last Exam, compared to 03-mini-high's 13%. This is probably mostly driven by using a variant of full o3. This is much better than DeepSeek's ~10%.

It provides a detailed summary of its chain of thought while researching. Subjectively it looks a lot like how a smart human would summarize their research at different stages.

I've only asked for one research report, on a test topic I'm quite familiar with (faithful chain of thought in relation to AGI x-risk). Subjectively, its use as a research tool is limited - it found only 18 sources in a five-minute search, all of which I'd already seen (I think - it doesn't currently provide full links to sources it doesn't actively cite in the resulting report).

Subjectively, the resulting report is pretty impressive. The writing appears to synthesize and "understand" the relation among multiple papers and theoretical framings better than anything I've seen to date. But that's just one test. The report couldn't really be produced by a domain novice reading those papers, even if they were brilliant. They'd have to reread and think for a week. For rapidly researching new topics, this tool in itself will probably be noticeably faster than anything previously available.  But that doesn't mean it's all that useful to domain experts.

Time will tell. Beyond any research usefulness, I am more interested in the progress on agentic AI. This seems like a real step toward medium-time-horizon task performance, and from there toward AI that can perform real research. 

Their training method — end-to-end RL on tasks — is exactly what we don't want and have been dreading.

Looking further ahead, we envision agentic experiences coming together in ChatGPT for asynchronous, real-world research and execution. The combination of deep research, which can perform asynchronous online investigation, and Operator, which can take real-world action, will enable ChatGPT to carry out increasingly sophisticated tasks for you.

Thanks, I hate it.

6 comments

Comments sorted by top scores.

comment by Nikola Jurkovic (nikolaisalreadytaken) · 2025-02-03T13:14:09.897Z · LW(p) · GW(p)

Note that for HLE, most of the difference in performance might be explained by Deep Research having access to tools while other models are forced to reply instantly with no tool use.

Replies from: Seth Herd, gwern
comment by Seth Herd · 2025-02-03T13:21:18.693Z · LW(p) · GW(p)

Yes, they do highlight this difference. I wonder how full o3 scores? It would be interesting to know how much I orovement is based on o3s I proved reasoning and how much is the research.

Replies from: Vladimir_Nesov, sweenesm
comment by Vladimir_Nesov · 2025-02-03T15:57:49.721Z · LW(p) · GW(p)

And how much the improved reasoning is from using a different base model vs. different post-training. It's possible R1-like training didn't work for models below GPT-4 level, and then that same training started working at GPT-4 level (at which point you can iterate from a working prototype or simply distill to get it to work for weaker models). So it might work even better for the next level of scale of base models, without necessarily changing the RL part all that much.

comment by sweenesm · 2025-02-03T13:40:57.443Z · LW(p) · GW(p)

How o3 scores: https://x.com/DanHendrycks/status/1886213523900109011

10.5-13% on text only part of HLE (text only are 90% of the questions)

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2025-02-03T15:59:59.958Z · LW(p) · GW(p)

10.5-13% on text only part of HLE

This is for o3-mini, while the ~25% figure for o3 from the tweet you linked is simply restating deep research evals.

comment by gwern · 2025-02-03T16:54:53.780Z · LW(p) · GW(p)

But does that necessarily matter? Many of those models can't use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).