OpenAI

ChatGPT's deep research might be the first good agent

OpenAI's new research tool still makes mistakes — but in its speed and average quality of analysis, it represents a remarkable step forward

Casey Newton

Feb 3, 2025 — 13 min read

Deep research in ChatGPT. (OpenAI)

This is a column about AI. My boyfriend works at Anthropic. See my full ethics disclosure here.

It feels odd to be writing a newsletter about new technology on a day like this. Elon Musk and his team of henchmen are working to dismantle the federal government with stunning speed. As many observers have noted, Musk’s initiative closely resembles his ruinous takeover of Twitter. And he’s now having his team remove X posts that name the people in charge of this new effort, a move that in his capacity as a federal government employee may actually violate the First Amendment. Former White House lawyers are calling the whole thing “wrong and illegal.” Just describing what is happening in plain language can make you wonder if you sound like a crank. But you are not a crank. The United States’ system of checks and balances — and the rule of law itself — are under direct assault; the Republicans who control Congress appear content to let it happen; and the risks that we fall into authoritarian collapse look as grave as they ever have.

But I’m going to assume that you knew all that already, and are perhaps looking today for a little distraction. In which case: let’s talk about OpenAI’s deep research.

Last month, Benjamin Breen, a history professor at the University of California at Santa Cruz wrote a blog post about the use of artificial intelligence in his chosen field. Breen describes himself as “very interested in the use of AI for experiential learning,” but has been dismayed with the rise of AI inside colleges so far.

“Ask anyone you know in education: ChatGPT has been a disaster when it comes to facilitating student cheating and — perhaps even more troubling — contributing to a general malaise among undergraduates,” he writes. “It’s not just that students are submitting entirely AI-written assignments. They are also (I suspect) relying on AI-generated answers far more comprehensively, not just in their homework but in their daily lives. This has a kind of flattening effect.”

At the same time, Breen has been exploring the use of AI in historical research: transcribing a block of text written in 16th-century Italian cursive handwriting, for example, and asking it to provide the surrounding historical context. Or analyzing a page from a manuscript of medical recipes written in Mexico in the 1770s and offer a historical analysis.

In perhaps his most impressive example of the state of the art, Breen supplied two large language models with quotes from two historical figures who are main characters in a book he is writing. Breen tested both OpenAI’s o1 model, which is meant to excel at reasoning; and Anthropic’s more general-purpose Sonnet 3.5. Breen is surprised to find that o1 came up with, in effect, eight different ideas for historical arguments he could pursue at book length.

Speaking of one such idea, Breen writes:

The above excerpt is… good. I almost want to say depressingly good, because at first glance, it’s fairly close to the level of analysis that I’m currently at with my book project, and it was generated in exactly 5 seconds.

II.

I thought of Breen today while using deep research, the latest offering from OpenAI. Announced on Sunday, deep research is effectively the second AI agent that the company has introduced in the past two weeks, after its more general-purpose Operator. But while Operator struggled to complete even basic tasks, in my early tests deep research appears to be impressively competent — and underscores how even the tools we have today will accelerate research, analysis, argument, and other forms of knowledge work.

To start, deep research is available only to subscribers of ChatGPT’s $200-a-month Pro tier. (Users are limited to 100 deep research queries a month, reflecting the high cost of the computation involved; for now, it’s accessible only on the web.) To use it, you type out your query as usual in the ChatGPT chat box and then click the “deep research” button.

ChatGPT then analyzes your query and asks you follow-up questions. When I asked for a report on a current subject of interest — how publishers can benefit from the Fediverse — the bot asked me four clarifying questions, such as whether I was looking from the perspective of a legacy publisher or a digital-only outlet, and how technical it should get in its analysis of the tradeoffs between using two different federated protocols. I answered those questions, and deep research got to work.

Like DeepSeek, OpenAI’s deep research exposes some of its chain of thought as it answers your query. This let me see some of the websites that the agent was visiting, what conclusions it was drawing from them, and how it was beginning to organize its reasoning. Five minutes later, my 4,955-word report was available. (Read the whole thing here.) It outlined how the Fediverse can help people find new news sources; offered real-world examples of how sites like The Verge and 404 Media are leaning in to federation; explored different monetization strategies and described the trade-offs involved with each; and analyzed the pros and cons of building on the two main federated protocols.

As it so happened, I had run the same search a few days back using Google’s identically named deep research product, which it released in December. (This may be the first time Google has successfully named something since it named itself Google in 1998.) Google’s deep research uses a standard LLM, Gemini 1.5 Pro, rather than a reasoning model like OpenAI’s version.

You can see it in the results. (Read Google Gemini’s take on the identical query here.) Google’s report comes in at only around 2,000 words, and is shallower in every way. It hits many of the same basic ideas, but at a higher level of abstraction, and doesn’t explore any of the nitty-gritty tradeoffs that ChatGPT discussed at great length. It isn’t bad — my first time reading it, before ChatGPT’s deep research agent had been released, I found it moderately useful — but ChatGPT blows it out of the water.

As you would hope, for a product that is 10 times as expensive.

That said: consultants charge their clients exponentially more to create reports like this, taking significantly more time. And while great consultants can undoubtedly still produce much better reports than this agent, I suspect some of them might be surprised how good deep research is even in this early stage.

Meanwhile their clients — like Professor Breen — may find themselves impressed with what a good research assistant is now available to them.

III.

Like every AI product, OpenAI’s deep research makes mistakes. Some of them can be quite embarrassing.

For my second deep research project, I asked the agent the following:

Put together an explanation of how deep research works that is sophisticated but still comprehensible to people who do not work in AI. Compare and contrast how it works with what we know about Google's own 'deep research' product that is part of Gemini. Suggest how OpenAI's deep research could fit into workflows for academics, journalists, product managers, and people who work in tech policy. Identify the likely next leaps forward in the technology that may make it even more useful in this regard.

This was a test of how useful something like deep research would be in helping me prepare to write a column. In effect, it’s the sort of thing I tell myself before I go research whatever subject I’m writing about. I wanted to see how OpenAI’s agent would perform given that it was researching a story that was less than a day old, and for which much of the coverage was behind paywalls that the agent would not be able to access.

And indeed, the bot struggled more than I expected. See if you can spot the mistake here (emphasis ChatGPT’s):

It’s no coincidence that soon after OpenAI announced deep research, Google rolled out its own Gemini “deep research” feature. Despite sharing a name and a general goal, the two have some key design differences.

Gemini’s deep research feature, of course, predated OpenAI’s. And the fact that it got something so basic that backwards should be enough to make you wonder what else it might have gotten wrong. (Nothing in this column, for what it’s worth, comes from the agent’s analysis.)

IV.

So what do we do with this?

OpenAI’s deep research suffers from the same design problem that almost all AI products have: its superpowers are completely invisible, and must be harnessed through a frustrating process of trial and error.

Generally speaking, the more you already know about something, the more useful I think deep research is. This may be somewhat counterintuitive; perhaps you expected that an AI agent would be well suited to getting you up to speed on an important topic that just landed on your lap at work, for example.

In my early tests, the reverse felt true. Deep research excels for drilling deep into subjects you already have some expertise in, letting you probe for specific pieces of information, types of analysis, or ideas that are new to you.

It’s possible that you can make this work better than I did. (I think all of us will get better at prompting these models over time, and presumably the product will improve over time as well.)

On the whole, though, I came away from my first experiments with OpenAI’s deep research agent impressed. Already, I think any number of consultants, academics, journalists, marketers, product managers, or tech policy professionals could find valuable uses for the agent. Where the Operator agent felt like a tech demo — barely more than a proof of concept — deep research is something that you can just start using.

And in case you want to read even more deep research, I also asked the agent to research some practical advice to people who may be living through authoritarian consolidation and wondering what they should do. Hopefully you’ll never need any of these tips. But just in case!

Elsewhere in OpenAI: