On language, stats, and football

Americans call aubergines ‘eggplants’. For those familiar with the deep purple skin of the vegetable(?), this seems peculiar. To British ears, it’s one of those “oh, America!” things to roll ones eyes at and quietly mock, like ‘pants’ and ‘soccer’ and the missing ‘u’ in ‘colour’.

And then you learn that there’s a variety of aubergine that is white, and small, and generally looks quite like an egg.


Image from Wikipedia

It doesn’t matter that Americans call the purple version an eggplant even though it doesn’t look like an egg. It’s a little confusing to people unfamiliar with the term, but once you pair the word with the object it’s, well, just like any other name. Carrots don’t look much like cars.

The name ‘expected goals’ has been debated for almost as long as the statistic has existed, from back when it was calculated on stone tablets in the far distant past of the Allardyce Age. Despite the name’s detractors, the metric — which is a computer-modelled judge of chance quality — remains to be named that way on the screens, airwaves, and pages of an ever-increasing number of outlets.

Y’know what name from that era, when ‘expected goals’ was first coined, hasn’t stuck? xG2.

If you’re into stats, you might know xG2 by the name ‘post-shot expected goals’ or ‘expected goals on target’. It’s different from the other/normal expected goals stat because that one is only built to judge the quality of an opportunity at the moment a player hits a shot. If a striker makes a hash of it, we still want to know that it was a good chance. Therefore, the model for xG doesn’t factor in where a shot ended up.

But xG2 does. Or, did, when it was called that. Anything off-target gets a post-shot xG value of 0, because the trajectory it was on means it would never turn into a goal. But if two shots were exactly alike at the point a striker hit them, post-shot xG will give a higher value to the one that goes towards the top corner than the one that goes down the middle of the goal.

xG2 died as a name, replaced by the more descriptive post-shot xG or xG on target. (Neither of those two options has definitively won yet, but my preference, as well as the frontrunner, is the former).

Seth Partnow, in a presentation at the 2019 StatsBomb conference, said that a good name for a statistic doesn’t ensure its take-up, but a bad one ensures it won’t be. I think I agree (with the caveat that absolutes make for good quotes but bad models of reality). All you want to do when naming something is to pick a moniker that won’t get it laughed out of town or put on a government watchlist. Who is ‘Aston Martin’ and what do they have to do with cars; how should I be expected to know that an ‘Apple Macintosh’ is a computer; why do Americans still call aubergines ‘eggplants’? It doesn’t matter.

The resistance to the term ‘expected goals’ is real, but I would guess that it’s primarily an opposition to the stat itself rather than the name. Remember when people made fun of ‘Twitter’ and, uproariously, quipped that even if they did join the platform they wouldn’t know what to twit and twat about? The problem wasn’t peoples’ resistance to ‘Twitter’, the word, it was peoples’ resistance to Twitter, the concept.

Language evolves in a strange way. Definitions of words guide how we use them, but these definitions are also shaped by usage. You also have very similar words meaning subtly different things in sometimes quite similar contexts.

Take football, and the phrases ‘a good finish’ and ‘a good finisher’. A ‘finish’ can be almost any shot that leaves the viewer scrambling for a synonym for ‘shot’. Usually it will be a placed effort, with the instep, but that could mean it’s struck from eight or 18 yards. It could also mean a header, or a volley.

A ‘good finisher’ is a much more narrowly defined description. Tap-ins and one-on-ones are the things judged here, usually. A player could dribble down the wing, cut inside, and curl a shot into the top corner from the edge of the box and that would be a great finish, but that wouldn’t be what comes to mind if they were described as ‘a good finisher’, I believe.

And then there’s ‘finishing skill’. In statistical circles, this would be ‘how does a player’s goalscoring compare to their expected goals figures’. These would include every type of shot: tap-ins, one-on-ones, curled strikes, volleys, placed headers, power headers, directing a pull-back from 12 yards, driving a strike from 20.

Disclosure: I strongly dislike ‘finishing skill’ being used in this context.

It doesn’t take a single skill to score goals from all those types of chances I just listed. It takes a whole range, and there are countless players who are very good at some but not so good at others.

It is confusing that ‘a good finish’ and ‘a good finisher’ have slightly different uses. It (to me) is confusing that a player’s ability to score headers and mazy dribbles and tap-ins might be lumped together as one single skill. It (to me) is doubly confusing to add the latter to the former linguistically and confuse things even further.

But what if ‘finishing skill’ takes off, as a way of saying ‘over/underperformance of expected goals’? Will I have been wrong to have disliked it?

Statistics aren’t the only area where this clash of languages is happening in football. By coincidence — or by Twitter — statistical and tactical concepts butted heads with the mainstream at about the same time. Talk of ‘pressing’ and ‘halfspaces’ riled the kinds of people who were also riled by ‘expected goals’. Just as I would argue that people took against expected goals rather than ‘expected goals’, I think people took against the presence of new terminology rather than the new terminology itself.

‘Pressing’ and even ‘counterpressing’ are firmly in the English lexicon now though (although some still use the German gegenpressing when referring to the latter). ‘Halfspace’ has even crept into the mainstream. It might become fully normalised, it might not.

One of the reasons it might not, in Engand at least, is because English football already had the term ‘channels’. ‘Running the channels’ is a phrase that seems as old as time to me, and which, while it doesn’t explicitly mean the same thing as halfspace, has a fair amount of overlap.

(I *think* that ‘running the channels’ means running into the physical channel left between centre-back and full-back in a back four, which tends to be a diagonal run from the centre of halfspace and ending in the halfspace or on the wing. It’s a slight problem that ‘channel’ could refer to a space between players or a more firmly defined area of the pitch, but it wouldn’t surprise me if English football adopted ‘left channel’ and ‘right channel’ to mean ‘left halfspace’ and ‘right halfspace’)

Halfspace or channel, neither matters. What matters is that you can communicate effectively with the people you need to communicate effectively with. Even within the professional game, clubs all have their different things they do and talk about that other clubs don’t. Phases of play, apparently, is a big one. ‘Build-up’, like beauty, is in the eye of the beholder.

As long as they have the term defined enough that everyone understands it, that’s fine for them. For media audiences, things are a little different because the net is cast a little wider. People are always, unseen, coming and going too, meaning that it’s hard to keep everyone up to speed. Yet media entities find a way of helping their audience through it (if they’re good).

The latest attempt by a niche to change the way people speak about football is StatsPerform (aka, Opta’s parent company).

As Mackriell was referencing, James Gheerbrant of The Times recently wrote a piece referencing two models: DAVIES from Sam Goldberg and Michael Imburgio, and StatsPerform’s own Role Discovery model.

As much as I believe these models may be useful, I don’t think they’re going to be become the ‘new descriptive norm’, for two reasons.

One is that I don’t think they’re changing much; the other is that the change they make might not be a good one.

I have a few disagreements with Gheerbrant’s piece, although I’ll flag very clearly that I think The Times’ sports pages are a far better place for having hired Gheerbrant, and the problem is mainly created from trying to introduce these models to the public, which is never easy.

The first disagreement is the set-up to the piece, which is really a way of talking about statistical modelling but disguising it as talking about the English men’s national football team:

The argument against picking five right-backs [as Gareth Southgate has done] is an excess of homogeneity — in other words, having five guys who all do the same job is a misuse of the finite places in an England squad. And if you accept the principle that players who occupy the same position in the team have the same function, this argument holds true.

First point of departure: I don’t think that ‘accepting the principle that players who occupy the same position in the team have the same function’ is at all widespread. Maybe people who don’t spend much time thinking about football think that Phil Bardsley and Dani Alves fulfil the same function, but I think most people accept that different right-backs have different skills.

If this was just Gheerbrant segueing into talking about modelling I’d let it slide, but it’s also the way that I’ve seen some people talk about this type of work. And that bugs me. A person who talks about midfielders as if they do or can all play the same role — who hasn’t already taken notice of the various ways we describe midfielders who do different jobs — isn’t going to change their vocabulary because StatsPerform ran a cluster analysis.

My second point of contention is with this:

But the rigid language that we use to label players (and this is partly a linguistic issue — other football cultures are permeated by much more descriptive terms such as regista in Italian and enganche in Spanish) still shapes how we think about them. Alexander-Arnold, a primarily creative player, is saddled with the tag of “defender” and the expectations that go with it.

I don’t necessarily think Trent Alexander-Arnold is saddled with expectations of ‘defender’, and if he is I don’t necessarily think that’s a language problem. There are other full-backs that we know don’t do much defending because they’re really attacking. We call them wing-backs.

Maybe Alexander-Arnold doesn’t fit ‘wing-back’ or ‘full-back’ perfectly, but some simple variation would probably work. ‘Attacking full-back’ or ‘high full-back’ or something.

According to DAVIES, he’s an ‘offensive wide progressor’. Per Role Discovery, a ‘wide active playmaker’. I dunno. If somebody doesn’t acknowledge that different right-backs play in different ways, I don’t think they’re gonna start calling them wide active playmakers in a Damascene moment.

I think that these models do have uses. If you’re recruiting and looking for players that can fit into your system, the option of searching for ‘wide active playmaker’ rather than simply ‘right-back’ could be really useful. Similarly, with regards to Mackriell’s tweet again and his reference to teams playing in ‘shapes’ rather than ‘formations’, that could have a real impact in opposition analysis. Think ‘find me all the clips when team X is in a disorganised transitional moment after failing to counterpress in an asymmetric 3-1-3-3’ rather than ‘when they’ve just lost the ball in a 4-5-1’. Or something like that (I confess, I haven’t seen much of their work on this aside from their ability to use tracking data).

But… None of this matters. I recently wrote about formations, and why we call something a 4-3-3 and not a 2-3-5. I tweeted the link out earlier today and someone sent me a tweet of theirs in reply.

I agree with it. We all know that formations and players play in different ways, even if we don’t talk about it in sophisticated language. Nobody has ever expected a team’s shape to look like table football. Gheerbrant says it himself in his piece; “we all know, for example, that Aymeric Laporte and James Tarkowski have vastly different roles in their respective teams.”

We know that players have different roles. It’s only been a decade since Michael Cox coined ‘inverted wingers’ and it stuck. Maybe we could use more Cox-es to make labels for player roles, but I flat out refuse to believe that people aren’t talking about roles and shape already.

I’m on the final stretch now.

I think that expected goals is a fair name, you’ve just got to be able to describe it well enough to push past peoples’ initial unfamiliarity. That’s the same with anything, though, xG or Twitter or TikTok.

But if you’re using a term that can be confused with existing vocabulary, you’re on uncertain ground. ‘Finishing skill’ is so wrapped up in the linguistic fuzziness of ‘a finish (shot)’ and ‘a finisher (player)’ that the concept would need to be really solid to break through. And I’m just not convinced it is. I don’t believe that the concept of a player over/underperforming their chance quality on all varieties of shots they take is something people talk about.

‘Halfspace’ is easy to describe, but there are existing words that sum it up that people can grasp at if they don’t like new things. ‘Left/right channel’ or ‘left/right of centre’ can be used. I think that ‘halfspace’ is an example that the new language you’re using needs to prove its usefulness, and perhaps it doesn’t hold a ton of use to mass media football talk.

Most people consume football via television nowdays, and even on radio detailed descriptions will only usually come to describe goals. A new term for an area of the pitch isn’t a high priority. But it’s a word that describes something; we’ll just have to see how the evolution of language takes its course on that one.

And with role descriptions, it’s something we’re already talking about. If people mistakenly think that Jack Grealish can be interchanged with Raheem Sterling or Jordan Henderson, that’s because they don’t know what Jack Grealish does. They’re relying too heavily on his position as a heuristic of his role, rather than not believing in roles at all.

There are questions to be asked, too, about whether a statistical model that measures on-ball output should be used to describe roles of players who do more than just on-ball stuff.

This has been quite a lot of me talking about what I don’t like, so I think it’s only fair that I offer up something that (while I don’t like it) I may have already proven to be wrong about. It’s the statistic ‘touches’.

‘Touches’ in football-data parlance doesn’t actually mean how many times a player’s body connected with the ball; it means how many on-ball events they did. Passes, take-ons, shots, that kind of thing.

I don’t like it. It’s not that it sounds bad, it sounds false. A player who’s only touched the ball three times may not have actually touched the ball three times; they could have touched it twenty times running the length of the pitch but might only be ‘credited’ with a couple of ‘touches’ for the entire motion.

And yet, in my more contemplative hours, I’ve tried to think of a better name. ‘On-ball events’ sounds too unfamiliar. ‘Actions’ also sounds awkward, like a slight mistranslation from German or Spanish. It’d probably stick if it had to, but it’s not strong enough to supplant ‘touches’. And maybe that’s fine.

Seth Partnow’s talk at the 2019 StatsBomb conference — which you can watch here — was titled “Analytics as Vocabulary: Giving Stats the Power of Language”. I like the title (well, I like the talk too) because that’s how I like to think of stats. Language is communication, and you need to be able to communicate the numbers.

Partnow’s bit about naming things — that a good name won’t ensure take-up but a bad one will stop it — reminds me of touches, and reminds me of giraffes.

I don’t like the way people talk about evolution. Even though it’s scientific theory, every interesting fact about how an animal has evolved is spoken about as if it’s a Grand Design, as if it’s clever. Things like “giraffes have long necks so that they can reach the leaves on branches that are high off the ground”. No; giraffes have long necks because the ones who had short necks died.

Americans still call aubergines ‘eggplants’. They haven’t died out because of that.