What do we actually know about football?

Modern football analytics - beyond just expected goals - has been around for over a decade now, but how much do we actually know about the sport?

Let’s take stock of where we are.

Expected goals has been on Match of the Day since 2017, two years after Arsène Wenger made waves by referencing it in a press conference. Between those two events, we got ‘pitch control’, a way of looking at a football pitch like a multi-dimensional tug of war between all 22 players on the pitch.

In 2018 and 2019, the public (or the public who avidly read papers from the MIT Sloan Sports Analytics Conference) got glimpses at exciting possession value models. In 2021, Opta - one of the largest data companies around - brought out their own Possession Value model. Later that year, Statsbomb (another data company) did the same.

So why, three years later, is a football analytics newsletter even asking the question ‘what do we actually know about football?’?

Because Pep Guardiola doesn’t use his substitutes.


When competitions like the Premier League were thinking about making the covid-induced ‘five substitutions allowed’ rule permanent, there was a lot of worry about rich teams. Would teams like Manchester City simply blow poorer opponents away by bringing on even more talent? How would we know, when there are games when Guardiola makes literally no substitutions?

Like a lot of things Guardiola does (apart from, perhaps, his clothing choices) the substitution-aversion seems to be catching on locally. The Premier League has consistently given substitutes less game-time than other major European men’s leagues but, unlike the pre-pandemic years, its now on its own in the extent of this. As Michael Caley notes in his newsletter, Expecting Goals:

Premier League teams only manage to get a fifth substitute into the match about one-third of the time under the new rules. In one match in ten they don’t even get around to the third substitute, while such a choice would be vanishingly rare in Serie A or La Liga.

This seems counterintuitive, right? That the richest football league in the world will, in one match in ten, not even use half of the substitutions available to them?

But then again… “I think there is a whole different strategy component to substitutions that no one has figured out yet,” Lucy Rowland, who’s worked at San Jose Earthquakes and Canada Soccer, told Get Goalside over email. “What time frame would be the most disruptive to your opponent for you to sub? Or is there a clear trigger in their tempo that will make you want to sub to stop that tempo. Do you sub all 5 players at once to give them a chance to find cohesion?”

Caley’s analysis has found that subs play more minutes when teams are losing(£). This fits an “if it ain’t broke don’t fix it” viewpoint, but still leaves Guardiola - who often seems to avoid subs when his team is struggling - as an odd exception.

It’s not beyond the realms of possibility that there’s something the Catalan intuitively understands that the rest of us don’t. Maybe he’s thinking along the same track as Rowland, that making substitutes when things haven’t yet clicked might actually disrupt the rhythm he’s worked in training to cultivate. But how do we find that out?

Subscribe to the newsletter

Vibe check

As we’ve seen, a simple question around substitutes opens other doors too, to things like tempo, team cohesion, momentum.

Those are three things that you could, if you wanted to, file as different varieties of ‘vibes’. Although analytics people tend to scoff at the emphasis put on ‘the intangibles’, that doesn’t mean they’re afraid to try and investigate them. There’s a paper here from 2020 on quantifying and predicting team cohesion. And there was a paper from the first Statsbomb conference in 2019 - whose authors may be familiar to Get Goalside readers - that tried looking at tempo, among other things, when breaking down a set defensive block. If running data is your jam, consider the following line from a 2014 paper on the physical output of substitutes, on further work that could be done:

It could be possible that substitutes are covering more relative distance [than starters] late in the game, but these attacking runs are less effective as the fatigued players on their team cannot respond to their runs.

Fascinating, but dastardly difficult, questions to get to the bottom of.

Tempo in particular is something that has been brought to the forefront of Get Goalside’s mind by the arrival of Roberto De Zerbi in the Premier League. It’s hard to watch his defenders stand with their studs on the ball and not think about it. But now consider statistical modelling that needs an outcome to aim at, usually selected as goals or shots, an ideal outcome that is very far from where De Zerbi’s players dawdle on the ball. How do you fit his strategy into a modelling technique that only counts a high-xG chance within ~10 seconds as ‘success’?

And, more broadly, isn’t it strange that in a sport that averages only three goals per game, teams spend so long… waiting?

This is probably not a problem of football players and managers spectacularly missing the point of their own sport. It seems much more likely that it’s a problem of analysing it.

Statsbomb CTO Thom Lawrence has previously talked about how “different players at different parts of the field have access to slightly different rewards.” Various types of modelling have repeatedly found that the ‘real estate value’ of areas of the pitch shoots up very quickly near each goal - perhaps best displayed in Karun Singh’s fantastic blog post, which coined the term Expected Threat.

Now, if the value of moving forwards in midfield is close to nothing, you might think that the logical thing to do would be to get the ball into the penalty area as quickly and frequently as possible. But that can’t be right. For one, because English football and analytics icon Charles Reep basically proposed that fifty years ago and was dismissed by modernity. For another, it’s not how the best teams in the world currently play or, really, have ever played. Something must be missing.

Perhaps the approach of this paper on modelling in-possession decision-making, from 2022, is a way to go. It uses different aims for different phases of play rather than just focusing on whether a goal might be scored or conceded in the next X seconds, as many possession-value models do. At the very least, it seems a little more like how football’s elite coaches think about the sport.

What we do know

It’s about time we talk about some things that football analytics does well.

“There are so many confounding factors in a team’s success that it’s really difficult to come up with an optimal way to play,” says Sarah Rudd, former vice-president of analytics at StatDNA and Arsenal and co-founder of analytics consultancy src ftbl. “What we can do, however, is keep chipping away at the factors of success – how do we create advantageous situations on the pitch and how do we avoid them.”

“I think the areas of soccer that are pretty well understood are penalty-taking, passing patterns and passing progression in build-up,” Rowland said. (And it seems quite significant that, after heat maps and varieties of radar charts, passing networks are probably the most popular data visualisation - there’s something meaningful that data captures there).

A line that football coach and founder of Spielverlagerung Academy, Martin Rafelt, said to Get Goalside about coaching struck a chord here: “More generally you could just say that things which are repeatable and consistent are well understood.”

“You want consistent effort, running, immediate transition, you always want to control space, always want to put pressure on the opponent, stop them from progressing into the centre and always want to not lose the ball, have clean passing and back passing options, always want to be well protected against counter-attacks.”

And in a wonderful point of synergy, we can bring pitch control back into this. Professor and co-founder of Twelve Football David Sumpter has previously spoken about using it to develop a strategy to protect against counter-attacks at Swedish club Hammarby, alongside the coaching staff.

It makes sense that repeatable actions and control are areas where coaching and analytics seem to align. Data people love a good sample size, or fear its absence, so anything repetitive feels like dry land (and is probably more automatable in reporting). Control, meanwhile, follows a simpler logic, or produces simpler probability curves, than chaos.

Improvements in the type of data available helps with this too. “Data sources such as tracking data, and more specifically broadcast tracking data, are really changing those types of discussions,” says Rudd. “Lots of things that couldn’t be measured in the early days are now being measured in a way that people feel confident in.”

It all adds to mean that “one thing that football analytics is really good at these days is accurately profiling a player,” she says. When a single video feed is able to provide the basis for event-based statistics, and physical and spatial metrics through tracking data, you can see why.

Subscribe to the newsletter

The long path of history

It’s worth taking a moment to reflect on how much things have changed in the sport. “Nowadays you can walk into just about any club in the world and people will be at least familiar with xG,” Rudd says. “That wasn’t the case when I got started [in 2012 at StatDNA]. I don’t think there’s a debate any more about if analytics can be useful in football, but there’s still a debate about how useful and in which applications.”

The simple fact that practitioners like Rudd and Rowland have established careers is a factor in the development of this knowledge too, and the sense of institutional knowledge around tracking data is growing as well. If you leaf through past papers from the MIT Sloan Analytics Conference or the (gone but not forgotten) Barça Innovation Hub Analytics Summit, you get work on automatically identifying formations, off-ball run categorisation, space occupation gain, and automatic corner kick categorisation.

None of those things necessarily help ‘understand’ the sport better in the way that expected goals models forced a reassessment on shooting locations, but they’re the bedrock of future analysis. And are already being turned into data products.

To phrase that all a little bit differently, even if we don’t know that much extra about football, we can definitely describe it better.

That said, while describing things better is a good thing, introducing new data as a means of achieving it creates its own questions. “In the early days we were really limited in what predictive models we could build,” Rudd says, “because we just didn’t have enough historical data, so everything was very descriptive. That’s changed for some data sources, but every time there is a new data source, you enter that cycle of starting off descriptively while you wait for the data to accumulate to build something more sophisticated.”

Where data has been consistent, the passage of time is a great thing. “That's the other thing about doing this work now,“ said Caley, of his newsletter Expecting Goals. “We've got nearly 15 years of statistics that have been collected [and available publicly] and way more leagues [than before].”

And bigger sample sizes matter because sometimes the data is messy for reasons outside of collection techniques and randomness. “If you've got thousands and thousands of nineties [minutes played by a player are often grouped into ‘nineties’ as a consistent denominator for analysis], some of those players are getting better and some of those players getting worse,” Caley continued, “some of those players are in a good mood, some are in a bad mood, but all of those things wash out over large enough samples such that you can start to make clearer claims [about trends].”

Rejecting the premise of the newsletter

Data being consistently collected, though, is not consistent throughout football. The richer leagues will tend to have the best and most-immediate coverage, and are the ones where data companies might consider delving back through the video archives to collect extra, historic seasons of data (if, of course, video archives exist, which has its own inequalities).

That goes for geography but also, unsurprisingly, for gender. “If you’re looking at a player who’s 18 and you want to project who they’re going to be when they’re 24,” Arielle Dror, director of analytics at NWSL team Bay FC, recently told The Athletic, “I mean, it’s hard even with the amount of data we have on the men’s side, but it’s certainly not possible with the data that we have on the women’s side.” Part of this, as Rudd pointed out to Get Goalside, is that women’s leagues often have fewer teams and fewer games, meaning that even when data is being collected, the sample sizes are usually smaller.

This topic obviously extends to data modelling. Statsbomb have previously presented on their work investigating gender-specific or ‘gender-aware’ models (the former being when only data from men’s/women’s football is included in the model; the latter when a model uses all data but has an indicator of which it came from). The upshot is that their gender-aware models performed better - which seems obviously better for analysts in women’s football than using the analytics equivalent of shrunk-down male football boots of xG trained on the men’s game.

However, even this raises a further question, about where and why you’d draw these lines. Would you use an expected goals model largely trained on adult men’s football for youth men’s football? (Here’s a recent paper looking at shooting in different levels of German men’s football that touches on this very subject). If differences between the men’s and women’s game stem from sex-based height differences, for example, would men’s football in shorter parts of the world also benefit from context-aware models; if the differences stem from gendered differences in funding, does that have implications in similar global circumstances?

Any headline containing the word ‘we’ is making an assumption about who ‘we’ is; but maybe ‘what do we actually know about football’ was also making big assumptions about what ‘football’ is.


There are some, of course, who still say that football isn’t a game played on spreadsheets. And maybe they’re right. Balance sheets are where it’s at. (What’s the old adage… Offence wins games, money wins championships? The most important data scientist is your accountant; less p-hacking, more gr€€n-hacking; etc).

Analytics people have long put their value in these terms, to be fair. There’s the old line about ‘I could save you millions just by sitting in the corner and vetoing bad transfers’. But perhaps wider squad composition is the next step.

“I think some of the areas we really struggle with are what is the optimal tactical approach for a club given the league, their budget, etc, and how does that impact squad-building,” Rudd says. Coming full circle to the impact of substitutes, “that also leads into roster composition,” Rowland says. “If research shows that subs are more or less important than we originally thought, then how we spend resources on getting depth players will surely change.”

But let’s broaden the scope again, beyond just playing staff. Analytics consultancies are moving more and more into the realm of advising senior, exec-level decision-makers. Even below the C-suite, analytics departments are more widespread and growing in size. So not only is ‘understanding football’ an important skill for ‘analytics people’, increasingly so is ‘understanding business’ and ‘understanding management’.

Managing people and projects is a real and underrated skill, and not only are analytics departments new, but they’re often adding several new information streams to existing organisations. Sarah Rudd has a positive spin on this: “A nice by-product of all of this is that it really forces you to think about what that process looks like.

“When there’s just one source of information (i.e. scouting reports) your process doesn’t need to be too defined or rigid, but as soon as you add more information you need to have something set out to explicitly state ‘when will I use this information versus that one, how will I weight these, what if they disagree?’” 

But on the flip side, “Without something in place it just turns into chaos.”

A contrived bow

Get Goalside once wrote that you can measure the progress of ‘analytics’ by listening to what analytics people are griping about. Conference papers, to be fair, can be used as a barometer too.

This year’s MIT Sloan Sports Analytics Conference has three soccer papers in the final of its research paper competition. They are about:

So, they’re about how we generate more high-quality data; about long-term data; and about high-level strategic approaches. That’s not because on-pitch matters are solved, but they’re just… not what the public work is like at the moment, a big departure from the 2015-2020 period.

What do we actually know about football? More than we did, for sure.

But what do we know about what we know about football? Maybe less.

And what does Get Goalside know about what we know about football? TBD.

Thank you all for reading. Get Goalside will be back with another edition soon but please get in touch with any comments, clarifications, criticism, complaints, sales pitches, gossip, chit-chat, questions, or ideas at getgoalside[dot]newsletter[at]gmail[dot]com.