14 December 2025

News

Generative interfaces 

‘Generative interfaces’, like ‘generative search’, is a phrase that immediately suggests all sorts of profound change, but also massive uncertainty. What would it mean for an LLM to generate a user interface dynamically, in response to a question or context? How does that compare to all the accumulated decisions and institutional knowledge that go into working out what options you should see - what happens if an LLM tries to work it out? 

This week we saw two interesting launches that point to how some of this might work, from different directions. 

On one hand, Google launched ‘Disco’, a browser that can generate simple apps based on the tabs you have open, letting your filter, combine and remix information across different websites. Will this work? Is this a scalable model? Is the a new take on generative search? Maybe, but it points to how much will change in the web as database front-end. 

On the other hand, Cursor, the red-hot AI coding assistant, launched a visual editor, letting you drag-and drop UI elements to build your own app or website GUI. One reaction is that this kills Figma, but Figma is a design ecosystem and platform, not just a GUI design tool. A jaded cynic might say that this is just Visual Basic all over again, or indeed Notion - a useful tool for simple things that bigger use-cases will outgrow (Webflow should be worried, though). But both of these point to pretty profound changes in the ways we separate interface, design, data and website: these aren’t the answers, but they point to the questions. GOOGLE, CURSOR

OpenAI distribution 

OpenAI did two interesting distribution deals this week: it licensed Disney characters and IP for its Sora social video app (with Disney taking a $1bn stake), and added Instacart to its app platform, so you can (try to) shop for groceries inside your chat - for example, take a photo of a recipe, a photo of your fridge, and tell ChatGPT to buy all the ingredients you don’t already have (in previous issues I’ve pointed out a few reasons why I think flows like that are easier said than done, though).

The common thread to both of these deals, of course, is the trade-off between control and distribution. Instacart is giving up control of the UX and the customer in exchange for OpenAI putting it in front of ChatGPT’s tens of millions of US daily active users. And Disney is thinking about how much media companies regret selling shows to Netflix in the past, building up a new competitive threat. So it’s putting Disney characters (and not actors or live-action footage, sidestepping those sensitivities) in a new place, for new money, but also getting equity. DISNEYINSTACART

OpenAI Enterprise 

ChatGPT still has far more consumer adoption than any of the other chatbots, but Gemini and Meta are coming up fast powered by their parents’ distribution, and meanwhile (apparently) only 5% of ChatGPT’s 800m WAUs are actually paying. Anthropic, on the other hand, is practically invisible in surveys of consumer use but has a healthy enterprise API business, as does Gemini, and so now OpenAI has decided to make a more aggressive push into enterprise, hiring the CEO of Slack as CRO. ENTERPRISESLACK

Capex wobbles

Oracle is a leveraged play on AI deployment capex, borrowing against its legacy cashflows to build data centres for Sam Altman, and so it’s also a bet that OpenAI will be to raise the capital to fund the $1.4tr and counting of capacity that it’s signed up for. The markets are getting more and more nervous about that, and as a consequence, Oracle’s bonds are now trading like junk bonds. LINK

Meanwhile, Broadcom reveals that the mystery buyer of $10bn of chips that it disclosed last quarter is Anthropic, buying Google-designed TPUs, and it placed another order for a further $11bn this quarter. LINK

Agentic standardisation

Anthropic, OpenAI, and Google (plus others) launched a standards body for interoperability of AI agents, with Anthropic contributing its MCP protocol. Meanwhile, a consortium of adtech companies has launched Agentic Advertising, which aims to be a non-profit setting standards for ads in LLMs, run by a former head of the IAB. 

Standards bodies are a classic part of a platform cycle, and a classic strategic tool, and there are three ways that they typically evolve. 1: One company’s proprietary tech becomes the de facto standard and all the losers create their own to try to break in (generally this doesn’t work); 2: the field is nascent and everyone values broader adoption over proprietary lock-in and so they make a standard, or 3: the same as 2 but it’s too early and that doesn’t work either (think of the long and painful history of smart home). AGENTIC AIADVERTISING

Meta’s AI reset 

Following Meta’s multi-billion dollar AI hiring spree, there’s a lot of arguing about strategy going on. Apparently the new arrivals want to switch from open to closed source, and to prioritise building a SOTA model over building more incremental and directly revenue-driving capabilities, and meanwhile (as you would expect) there are plenty of ‘healthy exchanges of views’ with the existing exec team.  Open source is an interesting question: Meta’s open source LLM strategy was a way to try to turn LLMs into cheap commodity infrastructure that it could build on top of; since then we have a bunch of very good open Chinese models, models do seem to be commodities, and costs fall (say) 50x a year for a given result. Meanwhile, it has to compete with companies with (for now) deeper pockets: it’s having to find budget for AI from the metaverse project. So should it go closed and go for more control and revenue? BLOOMBERGNYTIMES

The week in AI

OpenAI released 5.2 of ChatGPT, bumping performance on a few metrics and taking it back ahead of Google, for now. LINK

President Trump finally signed an executive order attempting to prevent US states from passing their own AI regulations, to avoid fragmentation and accelerate US deployment. That seems sensible on its own terms, but it presumes that the US could actually pass national regulations, which seems unlikely. LINK

Waymo is expanding to London. LINK

Meanwhile, Rivian is expanding its autonomy offering, LINK

The US military signed a Google Cloud deal for Gemini. It’s not that long ago that Google employees staged protests at the idea of a similar deal: now Silicon Valley is getting back to its roots. LINK

Ideas

A good primer from Joe Kaziukėnas on the state of AI and commerce. LINK

The return of vertical dramas (definitely not Quibi 2). LINK

How Indian political campaigns are using generative AI to create material, including misinformation, with ElevenLabs voices cloned into local languages (India has… a lot of languages, and you need over a dozen for any kind of national reach). LINK

Jeep did a TV campaign using animated wild animals commenting on the product. Nothing new, except that it’s all AI-generated, at a fraction of the cost of conventional digital effects. LINK

Consumer Reports thinks Instacart is using dynamic pricing to increase spend. LINK

An academic analysis of Reddit’s struggles to moderate AI-generated content. LINK

Outside interests

The story behind Kodak’s 1970s digital camera prototype. LINK

MAGA struggles with grammar. LINK

Data

Pew released a report on social media and generative AI use by US teenagers. About 30% DAU. LINK

Ofcom, the UK TMT regulator, releases its annual survey of consumer internet and media behaviour. Amongst other things, it reports a 7% year-on-year decline in search use as ChatGPT gains share. LINK

OpenAI released a big enterprise user survey on use and productivity gains. LINK

Column

Benchmarks and AI progress 

For a few years, one of the US late-night talkshows used to do piece each iPhone launch season. They would show people on the streets of Manhattan last year’s iPhone, say ‘it’s the new iPhone!’ and ask what they taught. Plenty of people would ooh! and aaah! and talk confidently about how much better the ‘new’ model was. 

Some of this was a joke about marketing, but it also revealed a maturing product category. Occasionally there’s an obvious, binary change (FaceID, a zoom lens), but mostly the change is that the screen, chips and camera are 10-20% ‘better’, and you can see that if you read the benchmarks, but it’s hard to see a tangible difference from model to model. 

I’m reminded of this now every time someone asks me what I think of the latest release from OpenAI or Anthropic. Have you tried ChatGPT4o? What about Haiku? What do you think? Well, let me see…  Am I using the new one or the old one? How would I tell?

Again, sometimes there’s an obvious binary change (multimodal, web search), but if one of the new viral AI influencers says the model is ‘better’, what does that mean? I can go and read the benchmarks, but what does ‘20% better’ in some of the dozens of benchmarks actually mean to me? (Meanwhile, there’s a live debate about how saturated these benchmarks themselves are, and how much they might be invalid since the answers might be in the training data.) 

One of the interesting tests now is to try to break the models - to ask them something that they ‘should’ be able to do, but can’t. Generally this means logic or maths problems that are phrased in a slightly tricky way, to try to see if the model is really doing reasoning or maths or only seems to be, and is actually ‘just’ doing statistics. Or, you can try prompts that you know didn’t work last time to see if they work now. (For both of these, though, it’s hard to be sure if you got a better answer because the model is better or just because, with a probabilistic system, you rolled the dice again.)

For smartphones, this kind of convergence was a mark of maturity, and it seems odd to suggest that LLMs are mature already. The bull case here is that in the last 18 months or so we haven’t really scaled these models by more orders of magnitude so much as iterated on the same generation, massively improved their efficiency, and added capabilities (multimodal, say). On that view the next wave of improvement is backlogged: we need the next generation of training scale, which means the $200bn that the hyperscalers will spend on capex this year, the next wave of Nvidia chips, liquid cooling, gigawatts of power hookups, and thing like training a model across multi data centres at once. In other words, the new model where you can really see a difference is coming next year, or the year after. 

Meanwhile, though, what do I do differently if the latest model is 10% better? What does that mean?

For a while now I’ve been comparing this to the first spreadsheets. If you saw VisiCalc in 1978 and you were an accountant, this changed your life: buy this and it could do a month of work in a day. But if you saw VisiCalc and you were a lawyer… well, that’s very clever, but it’s not what you do all day. In this analogy, software developers and marketers seeing ChatGPT are accountants seeing VisiCalc, but a lot of the rest of us are lawyers: it’s very cool, and I can use it for a few things, but it doesn’t change my day. Give me a word processor, though…

Pulling on this thread, I can think of dozens of places that open-ended automation could make my work much easier, but for almost all of them, there is a right answer, not a better answer and not a ‘better model’. I can’t rely on an LLM to give me the right answer - it gives me something that looks good, but that’s not the same thing. For these use-cases, if a model is 10% less likely to be wrong, it is still 100% useless. I need a binary change: either I need dedicated vertical software that wraps the LLM in software and tooling and ‘solves’ the error problem, or a model that doesn’t make mistakes, or knows when it does (both open primary science questions), or different use cases. Indeed, the classic pattern for any new tool is that first you try to force the new tool to fit your existing work, and then, over time, you find ways to change your work to fit the new tool. Right now we're still at the first step. So is the new version of ChatGPT better? Well, better for what?

Benedict Evans