The AI agents have arrived

Artificial intelligence can now compute for you on your behalf — and the web is never going to be the same

Casey Newton

Oct 22, 2024 — 11 min read

Anthropic's new computer use tool, shown in a screencap of a video demonstration. (Anthropic)

Earlier this year, in a much-discussed tagline from its annual developer event, Google promised that its AI-enhanced search engine would soon do the Googling for you.

Five months later, an even more expansive future is coming into view: one where your computer does the computing for you.

That’s the promise contained within Claude 3.5 Sonnet, the latest version of Anthropic’s flagship large language model. Starting today, developers have access to a feature called “computer use.” The company describes it this way:

Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental — at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.

A brief accompanying video shows an Anthropic researcher using its agent to gather information from various places on his computer and using it to fill out a form. It’s a mundane example, but that’s the point: building an AI agent smart enough to automate the drudgery that fills so many workers’ days.

Anthropic is quick to note that this first version of the technology is slow and makes lots of mistakes. But it also heralds the arrival of the next major phase on the AI labs’ road to building superintelligence.

Anthropic is only one of dozens of companies now working to build AI agents. Microsoft today announced 10 new automations for its Dynamics 365 suite of business applications. Asana rolled out a take on agents today as well. Salesforce’s rival Agentforce technology is due to become generally available next week. And a host of startups are racing to build “AI co-workers” of various kinds.

What makes Anthropic’s agent stand out is that it takes the same technology that powers the AI chatbots we have been using for almost two years now and lets it out of the text box. Instead of being limited to offering you text- or voice-based responses, it can now complete small projects on your behalf.

Ethan Mollick, an associate professor at the Wharton School of the University of Pennsylvania, got a chance to try Anthropic’s agent early. He had it whip up a lesson plan for him while he did other things:

As one example, I asked the AI to put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard. I also asked it to put this all into a single spreadsheet for me. With a chatbot, I would have needed to direct the AI through each step, using it as a co-intelligence to develop a plan together. This was different. Once given the instructions, the AI went through the steps itself: it downloaded the book, it looked up lesson plans on the web, it opened a spreadsheet application and filled out an initial lesson plan, then it looked up Common Core standards, added revisions to the spreadsheet, and so on for multiple steps. The results are not bad (I checked and did not see obvious errors, but there may be some — more on reliability later in the post). Most importantly, I was presented finished drafts to comment on, not a process to manage. I simply delegated a complex task and walked away from my computer, checking back later to see what it did (the system is quite slow).

Later, he used it to play the game Paperclip Clicker (“which, ironically, is about an AI that destroys humanity in its single-minded pursuit of making paperclips.”) It fares poorly — making one mistake leads it to make many more, forcing Mollick to intervene. Overall, he writes, the agent could handle a variety of tasks with some success, though not enough that he would feel comfortable routinely delegating work to it.

This will surely lead to many comical TikToks of Claude trying and failing to demonstrate basic computer skills. But I was struck by the company’s blog post on developing the agent, which notes that even at this most experimental stage, Claude is twice as good at navigating as its next-closest competitor — and maybe not as far from human-level performance as you might guess:

At present, Claude is state-of-the-art for models that use computers in the same way as a person does — that is, from looking at the screen and taking actions in response. On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.

To be clear, a grade of 14.9 percent is an F by most measures. But on this test, most humans only score a C. It’s a welcome reminder of how much trouble most of us have navigating computer-based tasks at least some of the time — and an important milestone on the way to agents that can make those troubles go away.

And what happens then?

It’s easy to imagine using an AI agent to manage your appointments and scheduling, fill out online forms and routine paperwork, draft replies to your emails, and shopping on your behalf. Or it could browse the web on your behalf, preparing a personalized digest for you that means you never have to fight against a paywall ever again.

It’s also easy to imagine an agent with those capabilities setting up spam operations, automating the production of AI slop websites, and overwhelming human-run businesses and institutions with a flood of AI-generated requests.

Either way, people who use AI agents will have to confront some very real privacy concerns. Earlier this year Microsoft had to delay the launch of Recall, a marquee feature in its new AI-centric PCs designed to let you search all past activity on the computer via AI-powered search of screenshots that it silently takes in the background for you. Security researchers pointed out, among other things, that users would be opted in by default, and that their screenshots were not encrypted, creating an appealing target for hackers. (Users now have to opt in, and the screenshots are encrypted.)

Anthropic will need similar access to a user’s computer to operate it on their behalf. And I imagine businesses will have many questions about what the company does with customer data, and with employee data, before letting anyone use it.

There also may still be real limits in how much we can expect from agents in the near term. One startup CEO ridiculed to me the idea, popular in AI circles, that “the next major programming language is English.” (In other words, the idea that you’ll soon be able to get software to do whatever you want it to do simply by saying so.) CEOs “program in English” all the time, he explained, by telling their human engineers what to build. And that process is famously error-prone and rife with inefficiency, too.

But to use another phrase popular among the AI crowd, the agent that Anthropic released today is as bad as this kind of software will ever be. From this moment on, AI will no longer be limited to what can be typed inside a box. Which means it’s time for the rest of us to start thinking outside that box, too.

Sponsored

Height.app—The only autonomous project management tool

Height is rewriting the project management playbook. Leading the next wave of AI tooling for product teams, Height proactively handles all of the tedious tagging, triaging, and updating, so you never have to again. Height autonomously takes care of product workflows like:

Detecting scope changes and mapping edits back to your specs
Triaging bugs, assigning priority and escalating as needed
Tagging and organizing backlogs by feature, estimate, and more

If you're tired of managing projects, it's time for Height. Join the new era of product building — where projects manage themselves.

Try Free

Elon Musk and the 2024 election

This absolutely could have been the subject of today's column. But I couldn't imagine telling you anything I haven't already said on the subject over and over again. Musk's attempted vote-buying scheme represents an extraordinary departure for big company CEOs, and may well be illegal. Had Mark Zuckerberg tried anything like this in 2020, Rep. Jim Jordan would have ordered airstrikes on Menlo Park.

(If you think I should have written this column instead today, I'd be curious to hear about it. Just reply to this email.)