AI companies hit a scaling wall

OpenAI, Google and others are seeing diminishing returns to building ever-bigger models — but that may not matter as much as you would guess

Casey Newton

Nov 14, 2024 — 12 min read

ChatGPT in voice mode on the iPhone. (Solen Feyissa / Unsplash)

Over the past week, several stories sourced to people inside the big AI labs have reported that the race to build superintelligence is hitting a wall. Specifically, they say, the approach that has carried the industry from OpenAI’s first large language model to the LLMs we have today has begun to show diminishing returns.

Today, let’s look at what everyone involved is saying — and consider what it means for the AI arms race. While reports that AI scaling laws appear to be technically accurate, they can also be easily misread. For better and for worse, it seems, the development of more powerful AI systems continues to accelerate.

Scaling laws, of course, are not laws in the sense of “laws of nature.” Rather, like Moore’s law, they contain an observation and a prediction. The observation is that LLMs improve as you increase the size of the model, the amount of data fed into the model for training and testing, and the computational resources needed to complete the training run.

First documented in a paper published by OpenAI in 2020, the laws have a powerful hold on the imagination of people who work on AI (and those who write about it). If the laws continue to hold true through the next few generations of ever-larger models, it seems plausible that one of the big AI companies might indeed create something like superintelligence. On the other hand, if they begin to break down, those same companies might face a much harder task. Scaling laws are fantastically expensive to pursue, but technically quite well understood. Should they falter, the winning player in the AI arms race may no longer be the company that spends the most money the fastest.

Last week, The Information reported that OpenAI may have begun to hit such a limit. Here are Stephanie Palazzolo, Erin Woo, and Amir Efrati:

In May, OpenAI CEO Sam Altman told staff he expected Orion, which the startup’s researchers were training, would likely be significantly better than the last flagship model, released a year earlier. [...]

While Orion’s performance ended up exceeding that of prior models, the increase in quality was far smaller compared with the jump between GPT-3 and GPT-4, the last two flagship models the company released, according to some OpenAI employees who have used or tested Orion.

Some researchers at the company believe Orion isn’t reliably better than its predecessor in handling certain tasks, according to the employees. Orion performs better at language tasks but may not outperform previous models at tasks such as coding, according to an OpenAI employee.

Reuters added fuel to the fire this week with a piece that quoted Ilya Sutskever, the OpenAI co-founder who left earlier this year to found his own AI lab, seeming to confirm the idea that scaling laws are hitting their limits. Here are Krystal Hu and Anna Tong:

Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training — the phase of training an AI model that use a vast amount of unlabeled data to understand language patterns and structures — have plateaued. [...]

“The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again. Everyone is looking for the next thing,” Sutskever said. “Scaling the right thing matters more now than ever.”

Finally, Bloomberg on Wednesday confirmed that OpenAI’s Orion had underperformed expectations and said Google and Anthropic had encountered similar difficulties. Here are Rachel Metz, Shirin Ghaffary, Dina Bass, and Julia Love:

At … Google, an upcoming iteration of its Gemini software is not living up to internal expectations, according to three people with knowledge of the matter. Anthropic, meanwhile, has seen the timetable slip for the release of its long-awaited Claude model called 3.5 Opus.

II.

So what’s happening? Experts offer several theories about why new models aren’t performing as well as their makers hope.

One is that AI companies are running out of high-quality new data sets to feed into their models. By this point, they’ve managed to scrape all of the low- and medium-hanging fruit on the web. Only recently did OpenAI, Meta and others begin paying to license high-quality data from publishers and other sources. Efforts to paper over this deficiency with synthetic data — data made by the AI labs themselves — have yet to generate a breakthrough.

Another, potentially more troublesome explanation is that superintelligence simply can’t be built with LLMs alone. Progress in AI is measured primarily in the ability of these models to hit various benchmarks. And while LLMs have been crushing benchmarks at impressive speed, the benchmarks can be gamed. More importantly, models fail to understand cause and effect and other tasks that require multi-step reasoning. They often make factual mistakes and lack real-world knowledge.

Perhaps the issue, as Tyler Cowen writes at Marginal Revolution today, is that we have taken too blinkered a view of “knowledge” — and placed impossible expectations on LLMs.

“Knowledge is about how different parts of a system fit together, rather than being a homegenous metric easily expressed on a linear scale,” he writes. “ There is no legible way to assess the smarts of any single unit in the system, taken on its own. Furthermore, there are many ‘walls,’ meaning knowledge is bumpy and lumpy under the best of circumstances. It thus makes little sense to assert that an entity is ‘3x smarter’ than before.”

If you are the sort of person who is concerned about AI’s rapid development, this could seem like welcome news. A roadblock on the way toward superintelligence means more time to bring systems into alignment with human values and prevent them from causing harm. It lets society adapt to the ways AI is already deeply affecting us, from education to publishing to politics. It lets us breathe.

But this may turn out to be a kind of wishful thinking.

III.

“There is no wall,” Sam Altman posted on X today.

It was the sort of gnomic pronouncement that Altman has increasingly favored this year, as he teases upcoming releases with photos of his garden and offers other Easter eggs toward his followers.

OpenAI had declined to comment on the reports above, but here was Altman appearing to deny them — or if not deny their individual claims, at least to deny their implications. OpenAI would not be hindered by any slowdown in the scaling laws, he seemed to say. And indeed, The Information reported that the company’s recent focus on its reasoning-focused o1 model represented one way the company is attempting to accelerate its progress again.

In its story, Reuters quoted Noam Brown, a researcher at OpenAI who worked on o1, at a TED AI conference last month in San Francisco. The o1 model uses a technique known as “test time compute,” which effectively gives the model more time and compute at the moment of inference. (In ChatGPT terms, inference is the moment when you hit enter on your query.)

According to Brown, the impact of test time compute can be extraordinary.

“It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer,” he said at TED AI.

No wonder, then, that AI CEOs seem relatively unbothered by the scaling law slowdown. Anthropic CEO Dario Amodei appeared on Lex Fridman’s podcast this week, and offered a similarly upbeat take.

“We have nothing but inductive inference to tell us that the next two years are going to be like the last 10 years,” said Amodei, who co-authored that first paper on scaling laws while at OpenAI. “But I’ve seen the movie enough times, I’ve seen the story happen for enough times to really believe that probably the scaling is going to continue, and that there’s some magic to it that we haven’t really explained on a theoretical basis yet.”

There are, of course, self-interested reasons to promote the idea that the scaling laws are intact. (Anthropic is reportedly raising billions of dollars at the moment.) But Anthropic has not been prone to hyperbole, and takes AI safety more seriously than some of its peers. If AI progress were slowing, I suspect many people at the company might actually be relieved — and would say so.

Ultimately, we won’t be able to evaluate the next-generation models until the AI labs put them into our hands. On Thursday, an experimental new version of Google’s Gemini model surprised some by immediately vaulting to the top of a closely watched AI leaderboard, placing it ahead of ChatGPT 4o. In a post on X, CEO Sundar Pichai offered the up-and-to-the-right stock-chart emoji. “More to come,” he said.

Achieving superintelligence may indeed require breakthroughs that scaling laws alone cannot provide. But there may well be more than one path to the finish line. Which means, among other things, that we should not lose focus on what it would actually mean if we got there.

Elsewhere in AI progress:

OpenAI is reportedly preparing to launch an AI agent, codenamed “Operator,” that can complete tasks like writing code and booking travel. (Shirin Ghaffary and Rachel Metz / Bloomberg)
The ChatGPT desktop app for macOS can now read code for some coding apps on Macs, such as VS Code and Terminal. (Maxwell Zeff / TechCrunch)
The Gemini AI app for iPhones is now available. (David Pierce / The Verge)

On the podcast this week: Kevin and I talk through the implications of the surprisingly pro-crypto Congress that will be seated in January. Then, ChatGPT product lead Nick Turley joins us to discuss the chatbot's first two years. And finally, we answer a listener's question about what social network they should be using today.