OpenAI launches its agent

Hands on with Operator — a promising but frustrating new frontier for artificial intelligence

OpenAI launches its agent
The Operator home screen (OpenAI)

This is a column about AI. My boyfriend works at Anthropic. See my full ethics disclosure here.


For the past few months, the big artificial intelligence labs have signaled that 2025 will be a breakout year for agents: tools that use computers on your behalf, working across different apps to perform multi-step operations just like humans do. 

Agents have rapidly become a top priority at AI labs, as they represent a key milestone on the way toward building full-featured virtual coworkers. Starting last year, a steady drip of announcements signaled that researchers are rapidly making progress toward that goal.

Anthropic introduced its first take on agents, which it calls “computer use,” into its API in October. Two months later, Google said that its Gemini 2.0 models were designed for “the agentic era.” 

Despite those announcements, though, until now most agents have only been available to developers. As of this week, though, that’s starting to change.

On Wednesday, Google announced that the mobile version of Gemini can now carry out tasks across different apps, such as searching for upcoming sports games and creating calendar entries for them, with a single prompt. A day later, the upstart AI company Perplexity announced a similar agent for its Android app.

And in the most ambitious AI agent launch to date, on Thursday OpenAI announced Operator: an AI agent that can browse the web and take actions on your behalf. In a live demo on Thursday morning, CEO Sam Altman and three of his coworkers showed some of what Operator can do: shop for groceries on Instacart, hunt for concert tickets on Stubhub, and booking a reservation at a local restaurant through OpenTable. 

Operator is available through what OpenAI is calling a “research preview,” and is currently available only to users of the $200-a-month ChatGPT Pro tier. The company plans to release it to users of its $20-a-month Plus plan and other users in the months ahead, it said. For developers, it will be available in the company’s API within a few weeks, the company said.

OpenAI explained how Operator works in a blog post:

Operator is powered by a new model called Computer-Using Agent (CUA). Combining GPT-4o's vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen.

Operator can “see” (through screenshots) and “interact” (using all the actions a mouse and keyboard allow) with a browser, enabling it to take action on the web without requiring custom API integrations.

If it encounters challenges or makes mistakes, Operator can leverage its reasoning capabilities to self-correct. When it gets stuck and needs assistance, it simply hands control back to the user, ensuring a smooth and collaborative experience.

After OpenAI’s demo, I paid to upgrade my ChatGPT subscription so I could use Operator. A few hours later, it was enabled on my account, and I spent the afternoon trying various prompts to get a sense of what it can do. 

How you feel about Operator will depend heavily on what you hope to get out of it. If you’re looking for a polished virtual assistant to which you can confidently hand off shopping and research tasks, I suspect you’ll be disappointed: at the moment, using Operator is significantly slower, more frustrating, and more expensive than simply doing any of these tasks yourself. 

If you’re interested in the bleeding edge of AI progress, though, and want to get a sense of what the future could look like, Operator offers a compelling demonstration. Using it on Thursday reminded me a bit of the first time I took a ride in one of Google’s self-driving cars: half-baked though the product may have been at the time, it also represented an extraordinary technological achievement and seemed likely to shape our world for years to come. 

Of all the tasks I gave to Operator, the one it performed best was probably the simplest. (It was also a task that Operator suggested I try in a grid of suggested prompts under the chatbot interface.) The task was to “suggest some top-rated walking tours I can sign up for in London.” Once I hit “go,” Operator opened TripAdvisor in a browser, did a search, and presented me with a series of walking tours to consider with the relevant links. The whole thing only took about a minute, and I could have easily left the browser tab to do something else while it worked.

Of course, there’s nothing that impressive about generating a list of London walking tours: there are many such lists all over the web, and the free version of ChatGPT will also give you a good list even faster than Operator. But there’s still something striking about telling a computer to do something and then watch it open a browser tab, type queries, click on buttons and present you with a report.

Operator also performed decently well on a task that I borrowed from Ethan Mollick, an associate professor at the Wharton School of the University of Pennsylvania, who had used it to test Anthropic’s agent: “put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard.”

Operator took about eight minutes to complete the task, but when it finished it presented me with a curriculum that seemed to satisfy all the requirements. As with the TripAdviser example, though, I found that the free version of ChatGPT answered it just as well — and did so more quickly.

For the moment, then, if your question is “what can Operator do better than existing tools?” the answer is not clear. It can take action on your behalf in ways that are new to AI systems — but at the moment it requires a lot of hand-holding, and may cause you to throw up your hands in frustration. 

My most frustrating experience with Operator was my first one: trying to order groceries. “Help me buy groceries on Instacart,” I said, expecting it to ask me some basic questions. Where do I live? What store do I usually buy groceries from? What kinds of groceries do I want? 

It didn’t ask me any of that. Instead, Operator opened Instacart in the browser tab and begin searching for milk in grocery stores located in Des Moines, Iowa. 

At that point, I told Operator to buy groceries from my local grocery store in San Francisco. Operator then tried to enter my local grocery store’s address as my delivery address. 

After a surreal exchange in which I tried to explain how to use a computer to a computer, Operator asked for help. “It seems the location is still set to Des Moines, and I wasn't able to access the store,” it told me. “Do you have any specific suggestions or preferences for setting the location to San Francisco to find the store?” 

At that point, I asked to take over. Operator handed me the reins, and I logged into my account and picked my usual grocery store. From there, I was able to add a few items into my cart by asking for them specifically. The process was painstaking and inefficient in a way that personally made me laugh but I imagine might drive others insane. In the end, adding six bananas, a 12-pack of seltzer, and a package of raspberries to a cart had taken me 15 minutes.

The experience revealed to me one of Operator’s key deficiencies: it can use a web browser, but it cannot use your web browser. This matters a lot, because your browser is already set up for you to use the web efficiently. You’re already logged in to the services you use most, and many of those services are further modified to reflect your personal preferences and make using them more efficient. Open a browser on a different computer, and every single time you’re starting from scratch. 

I felt this pain again when I tried another of Operator’s suggestions — checking the current price of an UberX to the airport. Operator successfully pulled up Uber’s website, but found that to check the price I would need to log in. This led to a tedious back-and-forth where I gave Operator first my email address, then my phone number, and then a code that Uber sent me over SMS — and watched Operator go dutifully re-type that information into its browser. Getting the price of an Uber ultimately took five minutes — which is four minutes and 50 seconds longer than it takes me to check the price of an Uber on my phone.  

Using its own browser has privacy and security benefits for Operator users: the app doesn’t store your passwords or credit card information or other data that could easily be misused. But this also has a lot of drawbacks — Dan Shipper, who has been testing Operator for a few days, found that it often cannot access websites because they have blocked OpenAI from crawling it. So if your dreams of using Operator involve doing things with YouTube, or Reddit, or Figma, you’ll find yourself out of luck. 

Ultimately, I think of this first version of Operator as a set of capabilities that are necessary but not sufficient to create a true AI agent. If any of this is to work, it first has to be turned into an actual product. No designer would make an online grocery service that didn’t start by asking the customer where he lives, or that made you log in to your account every single time you visited. But that’s how Operator works today.

It’s important to be honest about the limitations of these systems today. But it’s often more illuminating to examine the rate of change. In October, Anthropic’s agent scored 14.9 percent on OSWorld, an evaluation to test agents. Today, just three months later, OpenAI said that its CUA model scored 38.1 percent on the same evaluation. And humans, for what it’s worth, only score 72.4 percent.

Agents have a long way to go in product quality before they’re truly useful to the average person. But if the rate of change holds steady, that long way to go might not take as long as you think.


OpenAI and Stargate


Elsewhere in OpenAI:

On the podcast this week: Kevin and I recount TikTok's return from the dead. Then, we discuss the rise of Trump family memecoins. And finally, MSNBC's Chris Hayes joins us to discuss his new book on attention.

Apple | Spotify | Stitcher | Amazon | Google | YouTube

Governing

Industry

Those good posts

For more good posts every day, follow Casey’s Instagram stories.

(Link)

(Link)

(Link)

Talk to us

Send us tips, comments, questions, and Operator homework: casey@platformer.news. Read our ethics policy here.