22 March 2026

News

OpenAI re-re-refocuses

After last year’s ‘code red’, now OpenAI is apparently cutting back on ‘side-quests’ in the face of the massive enthusiasm for Anthropic’s coding tools. This would be a shift away from the ‘everything all at once yesterday’ approach last year - an app platform! No, another app platform! A browser! A social video app! Jony Ive! Medical research! Advertising! Instead, they’ll dig deeper on enterprise productivity and APIs on one side and coding on the other. That’s all very well to say, but we still don’t really know what the right user experience is, nor what OpenAI’s competitive advantage could be, and all those projects were experimentation to try to find something more durable than a commodity model. LINK

Meanwhile, the FT reports that OpenAI plans to double its headcount by the end of the year, from 4500 to 8k by the end of the year, across every function. LINK

AI GTM

OpenAI and Anthropic have been signing up strategy consultancies and PE firms as routes to market for big companies, as they’ve realised that grass-roots enterprise software adoption almost never works and giving people a tool and saying ‘it can do anything!’ isn’t useful (especially when actually, it can’t). PE, CONSULTANTS

Microsoft swirls

Frontier models are commodities, but something can be both a commodity and really hard. We saw this with chips, where at each generation the number of leading companies shrank, and we’ve seen it in LLMs: Meta fell off the ladder and spent billions in hiring to try to get back on, Apple may have given up (for now), and Microsoft and Amazon have failed to get on in the first place. Microsoft acqui-hired the Deepmind cofounder Mustafa Suleyman but has failed to break in with its own models, and its ambivalent relationship with OpenAI makes that a bigger and bigger problem. Now they’re doing a re-org to put Suleyman focused solely on models while others take over product. Obviously, everyone says they are very happy about this. LINK

Smuggling chips to China

It’s been clear for a while that a lot of people were smuggling a lot of sanctioned Nvidia AI chips into China (there couldn’t really be that many customers ‘in Singapore’), but now the US has charged a co-founder and board-member of Supermicro, a server maker, with diverting no less than $2.5bn of product without a licence. LINK

Jeff Bezos is doing an AI automation roll-up

There are quite a few new quasi-PE funds around that want to buy legacy companies and make them more efficient with AI. This seems a bit too deterministic to me, but the WSJ says that Jeff Bezos is out raising a $100bn fund to try this. LINK

The week in AI

Doordash is paying drivers to record videos of their tasks. It has 8m on the books in the USA, and this is probably more about creating and selling general-purpose training data (there are a bunch of platforms trying to crowdsource this, especially in emerging markets) than about building actual delivery robots. LINK

Google launched Stitch, an AI tool for designing app layouts. It's exhausting just trying to keep up with all the new tools. LINK

Google, Meta, Amazon, Microsoft and a range of other companies signed a co-ordination agreement to tackle online scams (which are already being accelerated by AI). LINK

Meta signed a five-year, $27bn deal with the Dutch neocloud Nebius. Meta plans to spend over 50% of revenue on its own capex this year, and deals like this make the total number (and the leverage in the broader market) even bigger. LINK

Amazon does 1 and 3 hour delivery

Amazon is leveraging the last-mile logistics that it’s built up in recent years to launch one hour and three hour delivery in over 2000 US towns and cities.

Meanwhile, it had some kind of falling-out with the US post office. The WSJ says that the USPS delivered 1bn Amazon packages last year, which was 15% of USPS total package volume. Now Amazon apparently wants to reduce that by 2/3 before the contract is up for renewal in October. Amazon responded by saying that this is just because it’s been unable to agree new terms with the USPS. DELIVERY, AMAZON, WSJ

Meanwhile, Shipmetrix thinks that Amazon itself now delivers more parcels in the USA than the USPS (or anyone else). LINK

Remember the Metaverse?

Following an announcement in February, this week Meta shut down the VR version of ‘Horizon Worlds’, a 3D world/platform/social network, that it ran on its Quest VR headsets (the app will continue on mobile). This prompted a lot of headlines that Meta was killing the ‘Metaverse’ that it’s named after. Andrew Bosworth, who runs the project, said that you can still use ‘the metaverse’ on mobile, which sounds to me like ’the Internet’. Then they backed down and said that the VR version would remain after all, for now.

Meta has spent close to $100bn on ‘Reality Labs’ so far, and it clearly doesn’t have any kind of consumer traction: VR headsets are good enough that you could argue that if there was a there there we’d have got it and yet no-one cares, while AR glasses remain multiple optics problems away from viability beyond prototypes (cf Magic Leap). Meanwhile Meta plans to spend over 50% of revenue on AI capex this year and is scrambling for cash, so this one goes on the back burner. Meanwhile, the company’s smart glasses (NOT AR, just a wearable display) actually do have some traction, and are clearly an AI endpoint, so that gets some focus. Bosworth’s comment also points to a conceptual problem: the word ‘metaverse’ became functionally meaningless, in that you could never know what someone else meant when they said it. Was this about VR? AR? Games? Crypto? Something else? What on earth does it mean to say that this is the next thing after the Internet and smartphones if it’s an app on your smartphone? LINK

In other news…

Samsung discontinued its $2900, three-panel folding phone. This was always really a concept rather than a commercial product, but you’d think they’d keep it on in the stores just for symbolic value LINK

Instagram very, very quietly stopped offering end-to-end encryption in its messaging function, probably in response to pressure on teenager safety. Ending this in WhatsApp would be a much bigger deal, I suspect. LINK

The US state of Arizona is accusing Kalshi, the prediction markets platform, of doing illegal gambling. Like it or not, I can’t see any way that this isn’t gambling. LINK

Google is spinning out Google Fibre into a PE JV. This was a good piece of industrial lobbying, but it worked, the industry moved to fibre, and Google has other priorities now. LINK

Ideas

The EU launched its competitor to Delaware: a single cross-country incorporation model for start-ups. LINK

Alibaba targets $100bn in AI and cloud revenue. LINK

Token allocation is now a basic part of the offer to a software engineer, especially as agentic coding makes it possible to use enough tokens to cost hundreds of thousands of dollars a month. LINK

Outside interests

Testing Nano-Banana for architectural visualisation. LINK

A vast collection of old department store catalogues. LINK

Data

Data on how far AI researchers have moved from academia to industry, earning far more in the process. LINK

Bain crunched numbers on how LLMs handle travel questions. LINK

Thomson Reuters survey on generative AI in professional services. LINK

Anthropic did a user survey, with a sample of 81k. LINK

Ramp’s latest customer data suggests that Anthropic is rapidly taking enterprise share from OpenAI. OpenAI said that the sample size is too small and this is like extrapolating from a ‘kid’s lemonade stand to the global lemon market’ - I have asked that question myself. LINK

Column

Jagged edges

As I was writing this newsletter, I wanted to double-check a number in the story about the USPS. So, I went to the USPS website, paged through the press releases until I found four quarterly reports with the relevant numbers, plugged them into an empty spreadsheet, and in five minutes I had the answer.

Then I tried ChatGPT. I asked for the calendar year, but it gave me the financial year number.

- No, give me the calendar year.

“Oh, okay, here's an estimate of what that might be!”

- Don't estimate. Give me the actual number.

“Those aren't in the press releases, so I calculated it.”

- No, the numbers are in the press releases. Go and get the number.

“Oh, sorry, you're right. Here it is.”

Sigh.

A few days ago, I took a several hundred-page PDF of a US census report from 1960 and said “Tell me how many people worked as librarians.” Gemini got the number, and even worked out that there are four different numbers with different definitions. ChatGPT also gave me a number… that wasn’t the number in the PDF.

What can we say here? Narrowly - if someone tells you that hallucinations are solved, they’re an idiot. If someone tells you that these models keep getting better, though, then they’re right, but that’s a much more complex statement than it seems.

‘Hallucinations’ and ‘error rates’ are problematic ways to describe what’s going on here: ‘jagged frontier’ is a much more productive term. These systems have very variable capabilities across different problems, and we don't necessarily know how to predict that for any given problem - we don't know if there's a pattern. Then, the tests that we use to assess these systems have a jagged surface of their own: we’re dreaming up new kinds of tests rather than measuring in any systematic way, since we lack a good theoretical model both of LLMs and of intelligence.

So you have a jagged surface of capability that doesn't mesh with an equally jagged surface of the evals. And then there's a third jagged surface, which is our intuition of what these systems will or won't be able to do. Pushing the point, there’s also a fourth jagged edge, which is the tasks we actually have, and that doesn’t mesh exactly with the others either.

In each of these layers, there are sometimes cases where we do have some understanding. We know that an LLM won’t give a good answer if that would require data it can’t access. We know that it's technically difficult to read PDFs. But, to repeat, we don’t have a good systematic understanding of this even if you’re deep in the science, and for a normal user, you just don’t know, unless you try, one thing at a time.

This takes me to another common claim: you have to use these things all the time, because then you’ll know that the thing the model couldn’t do a week ago works now! OK… but what that really means is that not only is the surface of the model jagged in ways we can’t necessarily see in any systematic way, but that it keeps moving and changing shape, and you don’t know where or how except by trying things and seeing what happens. Something of the same applied 20 years ago to Google: we learnt that there were some things where Google would or would not be able to give you a good result. But the difference was that you could see whether Google had found what you wanted, and LLMs fail silently. You can’t know before you hit ‘Enter’ if the LLM can do that, and you might not know afterwards either.

This is another jagged line: does your question even have a ‘right’ answer, and if it does, could you tell if the LLM had got it right?

In my example above, I want the exact number, and there is no ‘roughly right’ or ‘better’ answer. This is, of course, an unfair test for a probabilistic system, but that's not my problem: either I can use it or I can't. Then, though, as you move across the surface of use cases, you get more and more things that don't have specifically correct answers, and the correct answer also depends on who's asking. The example I think about sometimes is that if somebody asked me for a 1,000-word biography of myself, I don't have one, and if they use ChatGPT to make one, it would be full of mistakes that they couldn't fix, but if I used ChatGPT to make one, I could fix those mistakes.

Generalising this point: does the question need a precisely correct answer? If so, can you test the answer mechanistically? If not, is it efficient for people to check? For example, on one hand it's a lot easier to get an AI system to make 300 images and have a person check that than to have a person make 300 images, but on the other hand, if I use an LLM to retrieve 300 data points but then have to check all 300 myself, it would be quicker if I didn’t use the LLM in the first place.

This is a science problem, but it’s also a product problem. If you can validate that an LLM can do X or Y, or do it subject to the validation above, then you can unbundle that out of the raw chatbot into a company, wrapping it in tooling, GUI, data, GTM, and so on. But you can’t expect the user to run evals every week and know where all of those jagged edges moved with the last release. You can’t ask the user to guess or test what will work.

Benedict Evans22 March 2026