When is data 'right'?

data [noun]: information, especially facts or numbers, collected to be examined and considered and used to help decision-making (Cambridge English Dictionary excerpt)


There's nothing better to get rid of the idea that data emerges to us immaculately conceived, like the census.

Each one brings new debates about what questions should be asked, what options should be given, and how things should be phrased. The Wikipedia page 'Classification of ethnicity in the United Kingdom' alone runs to more than 3000 words. Not including footnotes.

Fortunately, football data doesn't have the self-identification sensitivities involved in the census, but most of the same basic problems still apply. The first one being: how do you categorise the data you're collecting?

Categories don't just spring from nowhere, created with some kind of distant objectivity. English-born Opta are, in my sphere of knowledge, the only data provider to classify 'tackles' as a separate and wholly individual event in their data. For others - like Wyscout, StatsBomb, Ortec - what English viewers might call 'tackles' are found under the family of 'duels', a more Continental approach. Neither is incorrect, they're just different ways of seeing the sport.

Can one set of categories be the 'right' one? Or just more appropriate and helpful for what the data will be used for?


Categorising the data isn't where the trouble stops. Once you've decided whether to collect tackles or duels you have to define them. How close do players have to be to make a duel? If a heavy touch goes to a teammate, is that a pass? When Sadio Mané slid towards Zack Steffen to score in the FA Cup semi-final, was that a tackle, a block, a shot?

Oh, and if it's not a shot, then that scuppers your nice rule that shots are the only event that can have expected goals associated with them.

Most people outside the data world would probably be confused if two data providers had two different shot counts for a match. The Mané example is far from the only area where definitions might be at odds though. Cross-shots are another obvious one. Contested first-time finishes, off cut-backs or conventional crosses, can be another.

If categories can't be 'right', then definitions can't either, only sensible. The fuzzy unusual cases, like Mané's goal, should also be treated with as much consistency as possible.

Thankfully, there are at least some things that are unarguable. Once you've decided on categories and definitions, assigning the correct player to that action is a matter of fact. Returning to the previous example (sorry Zack Steffen), you could call Mané's action a tackle or a shot, but saying that it was done by Mohamed Salah would be objectively wrong.

There's a kind of 'central source of truth' to things that officials are involved in too: goals, cards, fouls, substitutions, offsides, restarts (e.g. kick-off, goal-kick); although it can sometimes be tricky to tell who the referee has given a foul against. (VAR and playing advantage also throw up some minor philosophical questions — should you collect things which would have been fouls if the ref hadn't played advantage? is it incorrect to assume, therefore, that all fouls stop play?)

Official-related incidents only cover a small portion of a match though. For everything else, it's like the Sadio Mané shot again.


If you've read this far then you've already implicitly answered this question*, but you'd be well within your rights to be asking: why does any of this matter?

*Because it's interesting!

It matters if you're setting up your own data collection company, of course. It also matters, to a slightly lesser degree, if you're making a choice of what data source to use. And those decisions have to be made quite frequently.

As this newsletter has outlined, the number of cases where football data can be objectively incorrect is relatively small. Get players and teams right, get goals right, get location coordinates within a forgivable margin of error: sorted.

Most things outside that fall in the realm of 'is this sensible, unexpected but justifiable, or unexpected and faintly ridiculous'. It might be justifiable, for example, to have a slightly wider or narrower definition of a duel than you would use yourself. It might be faintly ridiculous, meanwhile, to [redacted]*.

*Note: no single company has a monopoly on mistakes and mishaps

How can you tell whether you're able to trust a dataset then? (Apologies to any data analysts/journalists reading who may be having unpleasant flashbacks). Diving into the documentation is a simple place to start. Does the way that they've categorised and defined their data make sense to you? Does it align with how you think about the subject, and how you plan to use the data?

Looking at some basic summaries is also good. Goal tallies should match figures from elsewhere, and other stuff, even if the definitions aren't what you'd write, should pass the smell test. Check for inconsistencies or missing data.

If time and data is available, recreating some of your existing processes or reports using a different data source is probably sensible. Fingers crossed that a team doesn't suddenly go from league-leading to mid-table in something. Because if you do then you've gotta work out why, and which version feels closer to 'the truth'.

Ultimately, it's a case of looking for whether the conceptual framework behind the data collection seems sensible and useful. Then a case of whether there are clear errors or inconsistencies. Then? There's a fair chunk of trust involved. You can't check every data point, although you should do your due diligence.


So when is data 'right'?

Data is right when it's correct in matters of straight fact, and sensible, consistent, and useful everywhere else. Many, if not all, football clubs still do some data collection of their own, with video analysts tagging matches post- and/or in-match. That's because existing providers don't match all of their uses, so they need to create some added useful data of their own.

A difficulty is that some potential purchasers may not have the time or know-how to do their own due diligence. There also isn't (at the moment, at least) the kind of public community due diligence like we've seen around pandemic public health data, where it's common to see tweets investigating and explaining quirks in the data (e.g. case reporting delays around holidays, or changes in collection practices).

Really, "when is data 'right'?" is a relatively easy question to answer. "Do you know if this data is 'right'?" — that's the one that takes the work to answer.