How Claude uses AI to identify new threats

PLUS: Exclusive data on how people are using Anthropic’s chatbot

How Claude uses AI to identify new threats
Claude's Clio, shown here analyzing the Wildchat public dataset and not actual Claude data. (Anthropic)

The company didn’t know it yet, but Anthropic had a spam problem.

Earlier this year, a group of accounts had begun asking the company’s chatbot, Claude, to generate text for search engine optimization — the art of getting a website to rank more highly in Google. There’s nothing necessarily wrong with a publisher trying to generate keywords to describe their site to Google. But these accounts, which worded their prompts carefully in an apparent effort to escape detection by Anthropic’s normal filters, appeared to be part of a coordinated effort.

The spammers might have gotten away with it — but then the network was spotted by Clio. An acronym for “Claude insights and observations,” Clio is an internal tool at Anthropic that uses machine learning to identify previously unknown threats, disrupt coordinated attempts to abuse the company’s systems, and generate insights about how Claude and the company’s artificial intelligence models are being used.

In the case of the spam network, Claude identified a cluster of accounts making queries that by themselves did not necessarily violate Anthropic’s guidelines. But when the company’s trust and safety team investigated, it determined that the queries were coming from a network of spammers. It terminated their access to Claude.

“Sometimes it’s not clear from looking at an individual conversation whether something is harmful,” said Miles McCain, a co-author of the paper and a member of Anthropic’s technical staff, said in an interview with Platformer. “It’s only once you piece it together in context that you realize, this is coordinated abuse generating SEO spam. Generating keywords for your blog? That’s fine. It’s only when you’re doing it across tons of accounts, abusively, for free that it becomes problematic.” 

Discovering these previously hidden harms — what the company calls “unknown unknowns” — is a core part of Clio’s mission. Anthropic announced Clio in a paper published today, alongside a blog post about how Claude was used in 2024 elections

The company hopes that other trust and safety teams consider using a Clio-like approach to serve as an early warning system for new harms that emerge as the use of AI chatbots become more pervasive.

“It really shows that you can monitor and understand, in a bottom-up way, what’s happening — while still preserving user privacy,” Alex Tamkin, the paper’s lead author and a research scientist, said in an interview. “It lets you see things before they might become a public-facing problem. … By using language models, you can catch all sorts of novel and weird-shaped use cases.”

How Clio works

In the Clio paper, Anthropic drew from 1 million conversations with Claude — both free and paid users. The system is instructed to omit private details and personal information.

Under development for about six months, Clio works by analyzing what a conversation is about, and then clustering similar conversations around similar themes and topics. (Topics that are rarely discussed with Claude are omitted from the analysis, which offers an additional safeguard against accidentally making an individual user identifiable.)

Clio creates a title and summary for those clusters, and reviews it again to make sure personal information is not included. It then creates multi-level hierarchies for related topics — an education cluster might contain sub-clusters for the way teachers use Claude and the way students do, for example. 

Analysts can then search the clusters, or explore Claude usage visually. Clio offers a visual interface similar to Obsidian’s graph view, linking clusters based on the frequency of discussion and how they may be related to each other.

Popular queries often link to other popular queries — a sign that many people use Claude for the same things. Less popular queries often appear in the visualization as islands — and it’s these islands that can highlight unknown unknowns.

The spam network was one of these islands. It was a group of accounts devoted almost entirely to making SEO queries, in the same technically-allowed-but-definitely-suspicious way. Upon discovering the island, Anthropic referred it to its trust and safety team, which ultimately removed the network.

Like most trust and safety teams, Anthropic’s already had tools in place to identify spammers. It identified keywords often used by spammers, and created machine-learning classifiers to understand when its systems were likely being used to generate spam. This is what the company calls a “top-down” approach to safety. 

Clio, on the other hand, is bottom-up. It isn’t looking for anything specific at all. Rather, it’s a way to help Anthropic understand all the ways that Claude is being used, and consider whether some of those uses could be harmful.

It works the other way, too: identifying clusters of conversations that Claude marked as harmful even when they are totally innocuous. 

Clio revealed, for example, that Claude was repeatedly refusing to answer questions about the role-playing game Dungeons & Dragons. As people asked the chatbot for help planning their attacks, Claude often assumed that users were planning actual violence.

The Clio team referred the issue back to Anthropic’s trust and safety team, which refined its classifiers.

It’s not just D&D. Clio found that Claude sometimes rejected questions from job seekers who had uploaded their resumes, since the resumes included the sort of personal information that can violate Claude’s rules in other contexts. And some benign questions about programming were refused because Claude mistakenly associated them with hacking attempts.

“You can use Clio to constantly monitor at a high level what types of things people are using this fundamentally new technology for,” Tamkin said. “You can refer anything that looks suspicious or worrisome to the trust and safety team and update those safeguards as the technology rolls out.” 

More recently, Anthropic has used Clio to understand the potential harms from “computer use,” its first foray into an AI system that can execute actions on a computer. Clio is identifying capabilities that might not have been apparent in pre-launch testing, the company said, and is monitoring how testers are using them in practice.

When monitoring activity around the 2024 US election, Clio helped identify both benign uses (like explaining political processes) and policy violations (like trying to get Claude to generate fundraising materials for campaigns).

How people use Claude

My conversation with Anthropic’s societal impacts team focused on how Clio can be used to identify harms. But it can also be used to identify opportunities for Anthropic, by highlighting how people use its chatbot.

In what the company is calling a first for a major AI lab, the Clio paper also highlights the top three categories of uses for Claude:

  • Coding and software development (more than 10 percent of conversations) 
  • Educational use, both for teachers and for students (more than 7 percent)
  • Business strategy and operations, such as drafting professional communications and analyzing business data (almost 6 percent)

The top three uses, then, account for only about 23 percent of usage. The long tail of Claude’s use cases appears to be extremely long. 

“It turns out if you build a general purpose technology and release it, people find a lot of purposes for it,” Deep Ganguli, who leads the societal impacts team, told me with a laugh. “It’s crazy. You scroll around these clusters and you’re like, what? People are using Claude to do that?” 

Among the other ways people use Claude, Clio found, include dream interpretation, questions about the Zodiac, analysis of soccer matches, disaster preparedness, and hints for crossword puzzles. Users also routinely subject Claude to the most difficult challenge known to large language models today: asking it to count the number of r’s in “strawberry.”

During a demo earlier this week, as the team clicked around Clio, I saw even more use cases: clusters about video game development, nutrition, writing creative fiction, and writing Javascript. Ganguli told me he began asking Claude questions about parenting after spotting it as a popular cluster within Clio.

What’s next

The Anthropic team told me they tried to share as much about how Clio works as possible in their paper, in the hopes that other AI labs would try something similar themselves. The paper goes so far as to include the cost of running Clio — $48.81 per 100,000 conversations. 

“We wanna make it as easy as possible for someone to pitch it somewhere else and be like, look, guys — they did it and it worked,” McCain said.  

In the meantime, Anthropic told me that Clio has become a meaningful part of its trust and safety efforts. But the company is already imagining other things it might do with the technology. 

Ganguli highlighted three. One, the company could use Claude to understand the future of work. What kind of jobs is Claude helping people with, and what does that suggest about how the economy is transforming? 

Two, Clio could change the safety evaluations that AI labs perform on their models. Instead of drawing from historical and theorized harms, companies can ground evaluations in the real-world usage that they’re seeing today.

Finally, Ganguli sees science applications. Claude is trained to follow a constitution; perhaps Clio could surface instances in which it fails to do so, or struggles with a trade-off.

Those uses of Clio seem benign. But they also highlight the deep sensitivity of the queries people make in chatbots like Claude. Anthropic is using the technology to identify harms, but it’s just as easy to imagine another company using similar technology to analyze consumer behavior for the purposes of advertising, persuasion, or other surveillance. It’s also easy to imagine another company taking fewer steps to preserve users’ privacy, using their queries in ways that could create risks to them. 

Ganguli says that one goal of publishing the results from Clio is to draw attention to risks like these.

“I feel strongly that, as we’re developing these technologies, the public should know how they’re being used and what the risks are,” he said. 

Sponsored

Keep Your Private Data Off The Dark Web
Every day, data brokers profit from your sensitive info—phone number, DOB, SSN—selling it to the highest bidder. And who’s buying it? Best case: companies target you with ads. Worst case: scammers and identity thieves breach those brokers, leaving your data vulnerable or on the dark web. It's time you check out Incogni. It scrubs your personal data from the web, confronting the world’s data brokers on your behalf. And unlike other services, Incogni helps remove your sensitive information from all broker types, including those tricky People Search Sites. Help protect yourself from identity theft, spam calls, and health insurers raising your rates. Plus, just for Platformer readers: Get 58% off Incogni using code PLATFORMER

On the podcast this week: Has TikTok's luck run out for good? Then Google's director of quantum hardware, Julian Kelly, joins us in a noble effort to explain quantum computing to the masses. And finally, Kevin investigates the cult of Claude.

Apple | Spotify | Stitcher | Amazon | Google | YouTube

Governing

Google's Gemini and mixed-reality news drops

Google unveiled a host of AI announcements on Wednesday in what appeared to be a kind of shock-and-awe campaign to demonstrate the company’s leadership in AI development. A new mixed-reality platform for Android followed on Thursday. The announcements include:

Industry

Those good posts

For more good posts every day, follow Casey’s Instagram stories.

(Link)

(Link)

(Link)

(Link)

Talk to us

Send us tips, comments, questions, and hidden Claude use cases: casey@platformer.news.