A defence of early-analytics' mistakes

The analytics ‘movement’ gets a lot of stick sometimes (although generally from people who are never going to give it much credit anyway it must be said). The analytics ‘movement’ also over-thinks things too much. I’m going to indulge both by writing something of a review and an apology of the things that ‘analytics’ has gotten wrong over the past five years or so.

Disclaimer: This will all be based on my recollections, which may be faulty, but probably little more so than anyone else’s. I’ll also likely slip into referring to ‘analytics’ as a singular entity with singular opinions, which it isn’t, but meaning is broadly clear.

‘Don’t talk about G, talk about xG’

There was a moment in time on analytics twitter when it seems you could scarcely move for discussions about whether ‘expected goals’ was a good name, why it was a bad one (the ‘why’ was always why it was bad), and what a better name might be (there weren’t many good options put forward).

In retrospect, I think that there was something accurate in these discussions — a struggle to communicate well the fruits of analytics — but a missing of the mark as to what the actual problem was.

In his talk at the recent StatsBomb Innovation in Football conference, Seth Partnow (former Director of Basketball Research at the Milwaukee Bucks) talks about ‘good’ and ‘bad’ stat names. Some examples of bad ones come from the old ice hockey days: Corsi and PDO, both giving very little insight into what the stats mean. ‘Expected goals’ may not be perfect, but it’s good enough.

A bigger problem was just that there was very little content out there that aimed and explained expected goals at a general audience. Joel of MessiSeconds fame was basically the only person seriously doing this.

Part of this, I think, is because the people who were involved in analytics early doors were somewhat wedded to the purity of the numbers and the statistical method (understandably and defensibly). Another part is that xG is a fairly limited tool, through which only a relatively small selection of stories can be told, and they mostly boil down to “this team/player you like? yeah, they’re Actually Bad”.

I exaggerate there for effect, but the dampening of excitement around players or teams on a hot streak was certainly a theme around that period of time, and it understandably pissed people off. Joel/MessiSeconds got a lot of angry West Ham fans in his mentions for his video in October 2015 which (accurately) predicted that the Hammers would drop off from their 3rd-place form. (They finished seventh).

And then the analytics ‘movement’ as a whole caught a hell of a lot of anger for its judgement of Marcus Rashford at the end of the same season…

Rashford/Iheanacho

The Marcus Rashford-vs-Kelechi Iheanacho ‘debate’ can be summed up really well by a single Michael Caley tweet (the tweet itself seems to have vanished into the ether, but I happen to still have it, having used it in an explanatory xG article at the time, in May 2016):

While the surrounding debate featured a lot more snark from both sides, this tweet is the essence of what this whole thing was. It’s well-centred, while giving an impression of why people so took against analytics twitter, particularly the more brash/sarcastic sections.

It should be said that even people who liked Iheanacho by eye have little answer for what’s happened to him since that 2015/16 debut season where he scored 8 goals in the equivalent of 8.5 matches (766 minutes) for Manchester City. He’s now very rarely getting gametime at Leicester City. So while one could say that ‘analytics got Iheanacho wrong’, so did everybody else.

But it was mainly the pissing on the Marcus Rashford bonfire that got people up in arms.

At the time he was vastly overperforming his expected goals (unsurprising considering that he scored in the debut of virtually every competition he played in), and stat-types pointed this out. They probably could have done so more tactfully, but it was a valid thing to note, and Rashford’s goal output hasn’t been anywhere near as high as it was in his early, explosive, debut days.

In that first half-season, he got 0.51 non-penalty goals per 90 minutes (h/t FBref); since then that rate has been 0.32, 0.44, 0.33, and 0.4 in 2019/20. In expected goals terms — and to provide a similar metric to Michael Caley’s original tweet, which had him averaging around 0.33 expected goals + expected goals assisted per 90 — he’s enjoyed a continual rise in the Premier League. For the past three seasons, his rate has been 0.42, 0.5, and 0.57 per 90.

That increase could partly be a natural progression as the player ages, and partly as Rashford has spent more time as a central striker. Crucially, though, that 0.42 from 2017/18 (the earliest season that the website has data) probably isn’t too far off what Rashford might have had in 2015/16 if he hadn’t been hampered by an unusually restrictive manager in Louis van Gaal.

While Rashford/Iheanacho will likely be held against stattos by some, I don’t think it’s wholly fair to use it as an example where analytics got it wrong (and where the predictions were wrong, the analytics ‘movement’ is far more likely to review and reassess methods and predictions than those outside it). What could certainly have been different is the tone.

The need to be loud to be listened to…

This is probably a good time to talk about why early analytics twitter seemed to piss off so many people with its tone. I’ve mentioned already how I think that xG analysis lends itself (too much) to party-pooping, and there’s another turn-off that seems to be baked in too — the tendency to come off as “I’m smarter than you” when using data to back up, or create, arguments.

These are things that I think even media communication experts would struggle to deal with (and god knows we in the analytics-verse are not that). But another element of the disdain some have for analytics twitter is the need to be loud to be listened to when you’re a small group who are trying to disrupt the status quo in one way or another.

It feels crass to compare analytics in football to activists trying to shift the Overton window in political discourse, but it also seems to me to be basically the same phenomenon. Radicals yell about their given cause and make wild claims while more moderate sections quietly go about their business and talk to the unconverted in a slightly less obnoxious way; and then the mass public grows in awareness and understanding of this new cause, and some become radicals themselves and some become moderates; and the cycle continues and continues as the movement grows and/or change occurs.

Naturally, though, the loudness annoyed some people, also being seen as representative of the wider ‘movement’ at times, and, I think, unintentionally resulted in another thing.

‘Don’t cross’, and other prescriptive advice

Some have criticised analytics folks for giving advice like ‘don’t cross’, advice which is, to them, either too rigid or too obvious.

Early analytics research did indeed include the observations that through-balls were more likely to lead to goals than crosses, for example. Also that corners into the box might not be as exciting as in-stadium crowds often appear to think they are (although, somewhat ironically, StatsBomb are now one of the principle cheerleaders of set-pieces).

There are a few things going on in the criticism that the analytics movement offered prescriptive, and very basic, advice, but the main one is that all of this didn’t tend to be advice at all.

More often, it was just research about football, trying to work out what we could learn about football from the data. Sometimes it backed up conventional wisdom (through-balls create higher value chances than crossing); sometimes it changed conventional wisdom (shooting from distance ain’t that great). Practitioners, though, tended to be aware of the limitations of their research.

One other important point is that the simple analysis done in the early days was very useful at quantifying aspects of the sport. It needed to be done, even if it wasn’t riveting and, more importantly, it was useful to know what conclusions were to be found in the data, even if it wasn’t riveting. Data analysis that backs up conventional wisdom is still really important because it helps rule out the possibility that conventional wisdom is wrong.

A final point to note is that this early, simple(r) analysis coincided with the need to be loud to be listened to, which isn’t ideal but was probably an awkwardly necessary part of the journey.

I’ve now written 1300+ words addressing the dissatisfaction and disdain that some people still hold for the analytics ‘movement’, and so I’m going to spend the last part of this focussing on happier things. Things I’d like to see more of/tell past-me to focus more on if I could go back in time 4-5 years.

The happy/advice section

Make it fun/relatable

This isn’t necessarily easy. The point of analytics is to be analytical, and that doesn’t tend to link up well with hot takes. As I mentioned before, a lot of the early analytics content was pouring cold water on the hot takes of the day. But maybe you can use stats to craft hot takes that have a larger chance of standing up over time.

I’ve also always thought that stats guys should just, like, fucking swear more. I think this is part of the popularity of things like the StatsBomb podcast: not that it’s an R-rated f-bomb-a-thon, but that it’s normal people talking about stuff. There’s a tendency in analytics-people writing — whether because of the weddedness to the stats or because of our largely middle class backgrounds — to be pretty high-fallutin’ in our language, and if we spoke like normal people rather than dickheads then people might’ve been more receptive to us.

Spend more time on trivia

By ‘trivia’ I don’t mean ‘who was the Shrewsbury Town captain in 1990/91’. I don’t even mean ‘Virgil van Dijk has made X interceptions this season; no, I mean things like how teams get to the box, which players pass the most on a team, splitting a player’s chances up into really good ones and ok ones and pretty bad ones. In other words, things that are relatively simple, relatively tangible, and relatively important to know.

I’ve been really inspired by the aforementioned Seth Partnow’s work on Twitter and in The Athletic since he left the Bucks, as it’s the kind of thing I’ve always liked messing around with. You don’t need to understand the science of statistical modelling to calculate it, and it’s all a lot less abstract than something like expected goals (which, while a pretty simple concept, is still nonetheless a concept one needs to get one’s head around).

If there’d been more ‘good trivial’ stats (and not just possession percentages) used by stats-folks and the media could that have paved the way better for expected goals? Maybe, idk.

In the end, it doesn’t matter

I’ve typed a hell of a lot of words here about stats and mistakes and things that could have been better, but in the end xG is on Match of the Day, Liverpool are getting adoring press over their analytics department, and The Athletic have been (for whatever reason…) liberally using Opta graphics in their articles*.

(*being from a company that helps media companies make data visualisations, I have opinions about this, but the main one is that the visualisations that we make at Twenty3 are great and The Athletic, like every other media company, should sign up to use our Content Toolbox)

The point being… if this was ever a fight, it’s the analyticos that’ve won it.

‘Don’t talk about G, talk about *x*G’

Rashford/Iheanacho

The need to be loud to be listened to…

‘Don’t cross’, and other prescriptive advice

The happy/advice section

‘Don’t talk about G, talk about xG’