So... everyone has hybrid data now
Previously on Get Goalside: "With the bombast of a start-up, StatsBomb talked about how their data was going to change the game. And the bombast was deserved. Because it did. Because, nowadays, everyone seems to have pressure data."
It's time to talk about new data again.
Like every time we talk about data collection, we need to introduce 'event' data and 'tracking' data as separate individuals. Event data tells you what player made what action, and where on the pitch. Tracking data tells you where everybody is, but doesn't tell you what they're doing. Event data is collected by people; tracking data is collected by cameras and software.
Or, that was how it always used to be.
If you were smart, you would always have been able to combine the two (if, of course, you had the two types). But not many people had those skills.
The hybrid data era began in 2018 when StatsBomb first launched themselves as a data company (they'd previously been an important analytics blog (which, disclosure, I wrote for occasionally) and a consultancy). Their data was event data — telling you which player did what action just like everyone else — but for shots, they'd offer something unique.
The 'freezeframes' would give the position of every player on camera when the shot was taken, as well as some information on how the goalkeeper was moving at that moment in time. (A StatsBomb blog post on these shot freezeframes is here).
Two incredible saves in normal time and extra time before three penalty saves in the shootout— StatsBomb (@StatsBomb) May 18, 2022
An all-timer of a performance from Brice Samba in the Nottingham Forest goal last night#NFFC pic.twitter.com/Qd5UdFDa2u
That's useful, right? You get to know how many defenders were in the way of a shot, where they were, where the potential pressure on the shooter was, where the goalkeeper was, et cetera. But why stop at shots?
Well, three years later, StatsBomb stopped stopping at shots. They started offering freezeframes on all of the events in their event data (which they call StatsBomb 360). I wrote about the StatsBomb 360 launch in this post. As I say in that piece:
"[T]he whole theme of enriching event data makes a lot of sense. It allows you to count more stuff (and more useful stuff at that) while not adding too much extra technical requirement in skills or computational power."
Subscribe to Get Goalside!
This is where I take a rare victory lap: back in the day, at the first Stats Perform Pro Forum I attended in 2018, I remember a conversation where I suggested that maybe in the future sites like WhoScored (which use event data) would also columns for tracking data-based metrics. Tracking data would be snipped up to produce regular statistics, and presented alongside event data. Hybrid.
Of course, by this time StatsBomb will have been well underway setting up their own hybrid dataset, and other people will have been using hybrid systems themselves. But still.
Now, StatsBomb's data might be a kind of hybrid, but it's not 'tracking data' per se. It doesn't need to bother being tracking data, it just takes a snapshot. Earlier this week, Stats Perform (who own Opta) launched their own hybrid dataset: Opta Vision.
Opta Vision is a hybrid system in its 'purest' state: combining Opta event data with STATS tracking data. (Stats Perform's press release on Opta Vision is here). Alongside the event data, insights that are only possible using tracking data can be revealed. Things like passing options and pass difficulty, how a team's shape is changing, and when players are making runs to make themselves available.
Both companies having hybrid data systems that cover the vast majority of a match feels like a turning point in football data. We've crossed the threshold into a new world.
This doesn't mean that non-hybrid data is useless. The choice of data provider that someone uses will always partly depend on price, and hybrid data seems likely to be at the pricier end. It will also always be the case that what you do with the data is more important than the data itself. I would cook an expensive cut of meat far worse than a chef would cook a cheap one.
While the biggest factor in a team's success at using data will still remain 'giving time to good analysts', hybrid data is going to help.
But it'll also make 'giving time to good analysts' even more important. Every time a data offering gets bigger that means there are more things to discover, more metrics you need to decipher the value of. StatsBomb and Stats Perform have themselves only recently adjusted their expected goals models based on data they've had sitting around for a few years.
D'you want to know something that means? It means it's the best time in years to get into this. I said we'd crossed a threshold, but a lot of people are going to take a while to work out what's beyond it.
New types of data mean new types of opportunities.
There wasn't a natural place to put this in the main body of this newsletter, but I wanted to touch on StatsBomb's recent American football data announcement too. (A link to StatsBomb's main American football data launch piece is here)
That data is also a hybrid approach, also built on freeze-frames, but time-based rather than event-based. This makes sense, given that pre- and post-snap movement are so important in that sport and you'd miss a ton of vital information if you only took a freezeframe for the snap, a hand-off or pass, and catches, tackles, whatever.
The frequency of events in (association) football probably means that you're capturing about as many freezeframes by taking them per event as you would if you took them per X tenths of a second, as StatsBomb appear to be doing in American football. That said, I'd be really interested to see a similar time-and-event-based approach in soccer.
The other part of this I'm interested in is the balance of computer vision and human collector input and level of accuracy of the data.
Former Milwaukee Bucks director of research, author of The Midrange Theory, and StatsBomb's Director of Basketball (hey, when did that job title change?) Seth Partnow was recently on the Expected Value podcast talking about this new data. One of the things he talked about was that it really matters that you get player identification correct at the line of scrimmage. In soccer, some tracking data systems have trouble at corners. In American football, everything's a corner.
There's one last strand to this American football StatsBomb data that is of interest to me.
In his launch article (linked previously), CEO Ted Knutson says "We’re actually delivering low frequency tracking data in the new data". 'Traditional' soccer tracking data runs at about 25 frames per second, but I've wondered before how many of those frames are actually necessary. How much could you strip back and still have useful 'tracking' data?
I suspect that those at the cutting-edge already, at least, have some idea; but I suspect that everyone else is gonna find out within the next 5 years.
Thanks for reading - if you've enjoyed this please share it around, subscribe if you haven't already, and consider becoming a full supporter of the newsletter if you have. Have a good day.