What's the worldview of your data?
Conflict of interest: My paycheck comes from a company that is pitched as being 'data provider agnostic'. Take an appropriate pinch of salt, or seasoning of your choice, when reading.
There's always a risk that publishing something after going to a conference will be interpreted as being about the conference. As proof that this piece predated Field of Play, here's a paragraph from the original draft:
Every event and stats data provider is implicitly selling you their worldview. Maybe that sounds dystopian to you, a straitjacket of the mind, but the alternative - doing your own event tagging - means that you need to come up with a complete worldview of your own. My own worldview?? I can't even come up with a coherent opinion on the Jorginho-Chappell Roan drama.
The world moves fast. The Google trends line for 'Jorginho' is now back at normal levels.
Referring to a data provider's collection as their 'worldview' puts an interesting spin on assessing their value. Some of them don't have (passable) phase of play data: does that fit with how you, and the job you're doing, see the game?
In certain parts, it's integral, but in others it might not be. There are parts of a football club where you could get probably get by with just expected goals and a couple of other statistics. Let the coaches analyse the games, then take a scan over xG, number possessions that enter the final third, maybe number of possessions starting in your defensive third that end in the opposition half. It'd cover chance creation, chance conversion, aspects of defensive efficacy – not an awful team-level review process.
But change your role, change your focus, and your view of the world will change. If you're focused on player development, you'll need much more granular data. But that doesn't just mean 'events': that a player blocked a cross is not particularly useful; you'd want to have data on any 'confrontation', whether a cross or a 'duel' was attempted or not, to help grade that facet of their game.
The idea that a data provider imposes their worldview on you is something that's been on my mind for a while. Partly because so much knowledge-sharing in the analytics world is about process, so it feels worth poking more at what actually matters in the data packages.
With this in mind, I put together a fun little game. Hit some buttons and find out what matters to you and what matters to the crowd. If you've got feedback, lemme know.
The ulterior motive here is obviously to hark back to a previous blog, 'What if we didn't care about passes?'. Hopefully we can establish once and for all that we could free up event collectors' time by being loosey-goosey with pass collection.
I'm curious, of course, how different the crowd's worldview will be from the 'worldview' of existing data providers. When companies like Statsbomb, Impect, Skillcorner burst onto the scene – giving off vibes of taking meaningful market share* – how much of that is purely because of the worldview of the data spec? How much is it because of associated toys – like platforms, interoperability –, how much of it is price, how much of it is 'grass is greener' syndrome, and how much of it is just a persuasive (or nicer) sales rep?
*I don't have actual numbers but, as always, am welcoming of more data
With the increase in broadcast tracking data, it feels like these questions will only become more relevant. Is the race among these companies going to be the race for a better worldview, for a dataset that lets people quantify their own worldview more easily, or for a 'good-enough' worldview that is easier to fit into Internal Processes?
I started this blog – all those many days ago when people were still mostly unaware of the blended family links between Jorginho and Jude Law – hoping to reach an answer to these questions. I didn't. My worldview is still in flux.
Amusingly, as I was putting the final touches of this together, I noticed Statsbomb recently updated their Python package to include their aerial duel rating system. 'Would you rather have duel win probability or a particular set of extensive, discrete tags' is exactly the kind of data availability trade-off I'm curious about and that went into the making of the football data feature ranking game.