20 July 2025
News
Agents for OpenAI and Claude
OpenAI’s big splash this week is its own agent platform, linking together ‘Operator’ (using websites for you) and ‘Deep Research’ (long-form analysis) such that ChatGPT can make use of the web, download files and make reports, spreadsheets and slides for you.
This is magic. But also, watch the demos, and you hear phrases like “it got 98% of the numbers right”. Is that good, or a deal-killer? Also this week, and on the same theme, Anthropic released a product aimed at financial services, which has “83% accuracy on complex financial modelling tasks”. Again, what does that mean? See this week’s column. OPENAI, CLAUDE
Windsurf fallout
Last week the management and core engineering team at Windsurf (AI coding) abandoned the company to work for Google, while Google paid $2.4bn into Windsurf as a ‘licensing fee’ - leaving the actual company, its customers and its employees marooned. This week Cognition, another AI coding startup, bought the remainder, mostly to get the sales team, meaning they at least get a softer landing. There is a crushing sense of urgency around AI at the big tech companies, but as I wrote last week, urgency means things get missed, and meanwhile these kinds of structures risk breaking the social contracts of what it means to fund or work at a successful startup. LINK
JP Morgan takes private markets under coverage
JP Morgan’s equity research group will start covering large private tech companies, even though they’re not directly investable, reflecting just how big they’ve become (Anthropic is apparently planning a raise at $100bn and OpenAI closed at $300bn a couple of months ago), and how much public markets investors need to understand them as part of the broader context. See my previous point. JPMORGAN, ANTHROPIC
Kimi as the new DeepSeek
Chinese open source models keep coming - Kimi is the latest, with top-tier benchmarks: it’s number 5 on the overall LMArena benchmark, matching Claude. SOTA foundation models are expensive, but there are a lot of them. LINK
The week in AI
Mark Zuckerberg says Meta will “invest hundreds of billions of dollars into compute to build super-intelligence. We have the capital from our business to do this.” Current capex guidance for 2025 is $64-72bn. LINK
Netflix used generative AI for an effects shot in an Argentinian sci-fi series. This shouldn’t surprise anyone. LINK
Google is expanding AI-based security capabilities. LINK
An interesting rumour: apparently Accenture has considered buying WPP. These are two companies being overturned by generative AI, from different directions. LINK
The US slightly relaxed restrictions on semiconductor exports to China, allowing Nvidia to resume sales of more powerful hardware in return for China relaxing restrictions on ‘rare earth’ minerals. LINK
Uber builds its robotaxi strategy
I noted recently that Uber may be looking at partnering with Travis Kalanick to invest in Pony.ai, a Chinese/US autonomy company. This week it said it would partner with Lucid (EVs) and Nuro (autonomy) to launch a fleet, investing in both companies. In parallel, it’s partnering with Baidu to deploy outside the USA. Just to keep things interesting, a short seller claimed Pony is a fraud. LUCID, BAIDU, PONY
Google merging ChromesOS and Android
Google has two main device platforms, ChromeOS and Android - it’s tried running Android apps on Chrome, but now it’s going to build ChromeOS on top of Android. This is probably part of a more general consolidation and housecleaning as the company pivots everything around AI, but the product strategy puzzles me. The appeal of ChromeOS is the simplicity and invulnerability, and how do you keep that if it runs on Android and runs Android apps? Android tablets have never rivalled the iPad - how does this fix that? (If it’s any comfort, Apple seems just as confused about its iPad strategy - is this simpler than a Mac or not?) LINK
Google TV?
Apparently, Google is planning a new entertainment studio to make programming that’s ‘positive’ about tech. How much money do they plan to spend on that? This sounds like an idea that shouldn’t have left the bar at the comms team’s offsite. LINK
Microsoft in China
ProPublica discovered that Microsoft was using Chinese engineers in China to do tech support for the US military. Microsoft is scrambling. Stories like this make me wonder why China needs its own spies. LINK
Ideas
Can LLMs do accounting? Error rates matter, and saying ‘the models are getting better’ misunderstands the issue: you have to presume there will be mistakes and look for use-cases where that doesn’t matter or where the errors are easy to find, not hand-wave it away. LINK
Using satellites to spot online scamming centres in the borderlands of Myanmar (they’re huge, and surrounded by barbed wire and guard towers to stop the workers from escaping). Imagine explaining that sentence to someone in 1990. LINK
The new Kia crossover EV accelerates faster than a new Ferrari. I remember a decade ago that a lot of people did not understand that Teslas were fast because electric motors work differently to internal combustion engines, not because of anything specific that Tesla was doing. LINK
Meta’s attempt to make WhatsApp a fintech platform in India is not going well, apparently. LINK
Hertz is using image recognition to look for damage to rental cars. LINK
AI capex is eating the economy. LINK
An essay on what it’s like working at OpenAI. No email. LINK
Outside interests
The Stock Market Computer, from 1967. Sadly, the ‘Dow Jones’ dial only goes to 2000. LINK
The magic IBM expanding keyboard. LINK
Your lost suitcase is probably in Alabama. Fascinating. LINK
Data
The New Consumer’s mid-2025 report. LINK
The Brookings Institute analyses the geographic spread of AI in the USA. LINK
Column
AI error rates, validation and leverage
Generative AI models are statistical, probabilistic systems, as opposed to all previous software, which is deterministic. That means they can answer new kinds of questions and answer questions in new ways, but it also means that there is, inherently, an error rate. They will sometimes be ‘wrong’. Does that matter? What does it mean to say that an AI is 98% correct on this task? What does it mean to say that the models are getting better, or that they’re the worst they’ll ever be?
I think it is a category error to say that the models are getting better, as though that answers the question. It’s true, yes, that they are getting better, but only in the sense that the error rate is reducing, not in the sense that it goes away. For simple questions, the answer might be completely right most of the time, but you can never be sure. For more complex questions, you can be sure that there are errors in the result, somewhere. That probably reduces continually, but it’s not going away, and we don’t know that it can, in principle, go away.
There is a class of questions that doesn’t really have a ‘wrong’ answer, or indeed a ‘right’ answer, only better or worse. Traditional, deterministic software really struggles with this, and generative AI handles it really well.
However, there’s also a class of questions where the answer can be wrong, in a binary sense, and so where the error rate of generative AI is a problem. You’ll need to check the answer, word by word and line by line. The solution to this might be to wrap the model in deterministic software and in GUI and tooling - to put it on a leash, or in guardrails. In other scenarios, the answer might be the other way around - to give the AI agentic access to deterministic tools that can give it ‘right’ answers. In other words, sometimes you put the model at the top of the stack, and sometimes you put it at the bottom.
But it seems to me that the higher level of abstraction is to think about leveraging verification. The model produces an answer, and then you need a human in the loop to verify the answer, so how leveraged is your human?
In some scenarios, the model is doing something very boring that would take humans a very long time, but it’s very quick and easy for humans to verify the answer. This has high leverage. For anything involving images or video, it might take a person hours or days to make that shot, but our mammal brains can spot a problem instantly. Conversely, there are other tasks where checking the results might take just as long as doing the work. If, say, you ask a model to find and compile data within a 500-page PDF, then checking whether it’s used the correct data might take almost as long as doing it yourself - the only way to know if each number is correct is to compare every single one. This has low leverage.
This is why I find it frustrating that OpenAI keeps showing low-leverage use-cases - scenarios where it would take you just as long to verify the model’s results as it would for you to do the work yourself. You're making the LLM do something it’s bad at and then asking humans to do a bunch of grunt work to correct the computer. Shouldn't you focus on the high-leverage applications, where the LLM is doing something it’s good at?
Leverage may also also be a good framing to use when people talk about generative AI replacing lawyers or consultants. If you replace the associates with models, then the partners will have to spend a lot more time checking the outputs - yes, associates make mistakes, but they don’t invent citations. Is it better to give the models to the associates for them to verify, and put the leverage there? Indeed, part of the business model for all of these companies is leveraging the partners with the juniors: adding models to that mix seems more interesting than replacing it. After all, when spreadsheets arrived in the 1980s, that didn't mean fewer jobs for analysts, but more, because it gave them more leverage.