Watching the World Cup With an AI Commentator
Can AI commentate a live football match? A look at the current state of AI for identifying, tracking and responding to real-time actions in everyone's favorite sport.
With the World Cup kicking off this week, it felt like a good time to revisit our Vision Agents AI football commentator and see how things have moved on after six months of model releases. It's no Thierry Henry, but the models are getting better and faster at video understanding.
Everything here runs in real time. The annotations come from Roboflow's RF-DETR nano running locally, which handles detecting the players and the ball. A separate processor keeps track of the match state: possession, shirt colours, visible jersey numbers, pressure, and field position. That part is backed by Gemini 3.5 Flash. The commentary itself runs on OpenAI Realtime.
OpenAI Realtime receives the annotated video stream at 3 FPS and generates the spoken, broadcast-style commentary. Detection events from RF-DETR trigger short prompts, and a debounce plus a speaking guard stops the agent from talking over itself.
The important design choice is that the realtime model isn't doing everything on its own. Roboflow gives it visual structure, Gemini keeps continuity across the match, and OpenAI Realtime handles the fast spoken delivery. The result is a smoother demo. Viewers see the actual overlays, and the commentator can reference grounded details like "green number 6" or "white number 11" instead of giving generic sports commentary.
I'm excited about models like the recent Interaction model from Thinking Machines, which is capable of full-duplex understanding. Maybe in another six months it'll be worth coming back to this and comparing today's models against tomorrow's :)
$ cd ../blog →