Do we know football well enough to have good defensive stats?

Some problems

There's a fact I've always had trouble fully believing. The one about the ancient Greeks not having a word for the colour blue.

Ancient Greek texts had a curious absence of references to the colour. Sky was even referred to as 'wine-coloured'. When this was first noted, it was assumed that they just couldn't see blue. As it turns out, they could (their eyes are the same as our eyes), they just didn't seem to have called it anything yet.

However, there are also all sorts of weird studies that try and pick apart whether we perceive colours differently depending on the words we have for colour groups (usually blue and green). One involves the Himba tribe of Namibia, whose primary colour groupings group blues and some greens together while delineating more clearly between different types of green. They could, apparently, tell subtle differences in green shades quicker than the difference between blue and green.[1]

How much does our language interact with our experience of what we see? Seth Partnow's new book, The Midrange Theory, had me thinking about this.

Partnow, the former Milwaukee Bucks' Director of Basketball Research, writes in part of the book about how the terminology of basketball stats might be different if they were all created now. For example, shots "can and probably should be divided into two broad categories: attempts at the rim in one group, and jumpers in another."[1]

He continues by making the point that the most accurate 'shooters' (people whose 'shots' most often go in the basket) are those who take shots close to the rim (dunks and lay-ups), even though that's not quite what people picture when they think of basketball 'shooting'. As he puts it: "[...] as measured on the stat sheet, they are "great shooters" even though on the scouting report they are listed as "non-shooters"."

The same principles, I think, apply to football. (Of course I think this, I've written several times before about language and terminology in football (1, 2, 3, [3]))

But what Partnow was saying became all the more relevant to me this week because I was thinking about clearance statistics. Specifically, why the good people of the football stats internet have not risen up in revolt about them yet.

I've long written off clearances as a useful stat. Why?; overwhelmingly they come from heading or booting away opposition crosses. And therefore the most it tells you is how much a team or defender had to defend against balls into the box.

But not all clearances are like this. Sometimes a player could make a different choice. On rare occasions, when they're under no pressure or the cross is weak, this can be when defending a ball into the box. Slightly more often, the alternatives to a clearance are in situations when a player heads away a long ball or feels pressed into a corner near the sideline.

And I'm not just having a go at data providers for the sake of it (for various reasons, it would be shooting myself in the foot to do so), this information could have real value. How many times does a player clear the ball when they didn't have to? Does this mean they're a panicky player? Does this indicate something about team styles? If I had those pieces of info, they'd both be going straight into my pre-match reports.

I don't have full documentation of every data providers' offering to hand[4], but I don't think they have on offer what I'd want.

And while I've always kind of accepted that as the way things are, maybe I shouldn't have been.

Some suggestions

Back in the day, when tracking data still seemed like a dream full of possibilities to the public analytics community, it was assumed that it could solve 'the defensive issue'. It may well do, eventually, but there's still room for better defensive data in the manually-collected event feeds too.

I wrote last week about how pressure data is popping up in more places since StatsBomb started collecting it in 2018. Wyscout and StatsBomb also both appear to have[5] a separate category for tackles which try to stop players who are dribbling at them. This implies that all their other defensive duel/tackle events are initiated by the defender. Useful information! Wyscout also have an 'Anticipation' tag which they attach to some defensive duel events for when a defender nicks the ball off an opponent's toe. This separation from other types of duels and interceptions strikes me as very sensible.[6]

There will, presumably, be some other neat combinations that are possible by diving into providers' full event data and event qualifiers sets. However, that's not where the thinking about the sport takes place. Constructing meaningful basic statistics from the bucket of tags is like Apollo 13 making a DIY air filter. You can do it, you might learn something from doing it, but damn if it wouldn't be simpler and more helpful not to have to.

And while all this tinkering with tags is nice, it's also a little small-fry.

As Partnow says about shooting in basketball, if we were starting from scratch I don't think we'd have the statistical landscape for defending in football that we have now.[7] In fact, as he also teases in a footnote, our lack of conceptual understanding of defending probably holds back how we choose to collect data on it.

If defending is all about space, why are the defensive statistics so much about how a player affects the ball? Maybe we could be counting 'possible passing lanes covered' or something. You'd have some positional normalising to do when analysing those figures (midfielders would be called upon to cover a possible passing lane more than central defenders), but it could be useful, no?

The elite defensive players, the N'Golo Kantés of this world, maybe they'd show up with sparkling numbers because they're able to switch from one possible passing lane to another quicker than other players. It's not that they're quicker or cover more ground, it's that they see where the passage of play is heading and can move sooner.

Maybe you could track something around the speed that a player gets out to press someone and then recover back into position. Or, if it's simpler, when they're too slow to get out or too slow to get back in. The data collecting world is your oyster.

These might not be good suggestions. I don't know if the numbers would be genuinely useful for a start, and the descriptions of them are a bit unwieldy. Speaking of practicalities, the collection process might be unwieldy too.

Annoyingly, if Partnow's suggestion that a lack of language of defensive qualities leads to lack of useful statistics, we might have to wait until football as a whole gets better at judging defenders before we get good stats on it. I hope we don't.

But at the moment, we are like the ancient Greeks, lacking a word for the colour blue. It remains to be seen whether that's because we don't see the world that way or just because we haven't gotten around to naming it yet.[8]

Tell a friend, tell a colleague, subscribe to Get Goalside. Oh, and buy a copy of The Midrange Theory


[1] Given that this is a single study and I'm vaguely aware of how many 'pop science'-fodder studies get heavily revisited afterwards (marshmallow anyone?), I feel like I should say that I wouldn't be surprised if these findings also got revised at some point. But who knows.

[2] The Midrange Theory, ch. 2.

[3] Those three links not including a lost-on-an-old-blog-but-stored-in-my-files-somewhere piece where I watched lots of examples of Opta 'interceptions' to see whether they all fit the concept I had in mind. They didn't, quite.

[4] Don't have access to full documentation yet. fingers crossed.

[5] I say 'appear to have' because I'd prefer to check my assumption, comparing data to video, to be sure that my idea of what is being collected matches the image of situations I have in my head. Similar sentiment applies to the 'Anticipation' tag Wyscout have that I mention in the next sentence in the body of the newsletter.

[6] I wonder if part of the reason why 'public analytics' sometimes feels stunted is that the data available isn't even everything in the data providers' events feeds. There isn't even good delineation between different types of long passes. StatsBomb's free data is admirable, and is a fantastic tool to help people learn to get to grips with data, but the datasets -- WSL aside, either cup competitions or full seasons for one team/player -- tend not to be great for getting a proper sense of new metrics. I previously tried looking at how you might approach playing the Arsenal Invincibles, and I think that piece stands up well, but I don't know how much I'd be able to apply things in there more broadly. Building models off that data would certainly be awkward.

[7] bloody duels

[8] The circle of this metaphor is only, only just over my threshold of "yeah, that works" to stick with it. But I'm really baffled by whether people see the colour blue or not, so all that about the Greeks has to stay in now.