Square one
Alt title: "The publicly available data source is dead; long live the publicly available data source."
On Tuesday 20th, a lotta folks were saddened by the abrupt departure of advanced stats from FBref, one of - if not the - best public reference site for data. We raise a perfectly-measured glass in commemoration, and hope for a swift resolution in some form.
But it left a lot of people thinking about access to 1) data 2) smarts.
So this post will have three sections:
- what we kind of know about football
- where to go to find data
- where to go to find insights
My opinions are my own and will be inevitably skewed by various factors. This post will avoid being long, because it's better for people to read a smaller amount of detail than not read a larger amount of detail.
To the long-time readers and in-football workers: let me know if and where I'm leading the newbies astray. For the newbies, I've been around long enough to say that my first football analytics conference was nine years ago.
State of play - football
The biggest thing that 'analytics' has produced is still - and possibly will always be - expected goals. But xG encapsulates so much of the rest of football data: its insights are less about discovering a Hidden Factor, and far more about the balance between Factor A, Factor B, and Factor C.
Expected goals' biggest learning isn't really "don't shoot from distance". It's more like "you're meaningfully underestimating how much harder it is to score from 25 yards, and from wide angles in the box". (That, naturally, leads you to shoot from distance less often).
There's probably another couple of features waiting to be neatly packaged into soundbites: sight of goal, and pressure on the shooter. Data provider Statsbomb have written about these before (sight of goal here; can't find the blog on pressure), and it's a near-certainty that others have done this work in private.
Again: these are factors that everyone knows makes some kind of difference. But knowing how much difference is information that could help you craft your tactics. For example, if shots from 25 yards are worth 0.02xG on average, does an absence of pressure raise this to 0.03xG, 0.05xG, or 0.1xG? That knowledge is the difference between deciding to consciously create those chances and deciding they're not worth the effort.
Elsewhere on the pitch/spreadsheet, things are similar. 'Everybody knows' that good teams have good athleticism and more of the ball, but people are still feeling out what the magnitudes are. What does it mean if a team can sprint 10% more than their opponents? What does it mean for a player to be safer on the ball when under pressure?
There's an increasing focus on mixing tactics with physical data: players focusing on sprint distance will soon be a thing of the past. It's more likely that coaches will start pinning up things like 'recovery sprints' in the dressing room after a match. The change is partly about what actually adds to player fatigue (frequent accelerations and decelerations can be more onerous than the distance at high speed), and partly about what's meaningful in a game. The interaction between physical output and time spent in- and out-of-possession, as well as time winning/drawing/losing, will be better explained too.
When it comes to possession, things seem more open-ended. The buzz of 'line breaks' is still around, but it risks being a false idol. There's little use in fizzing a line-break through to a teammate who can't control the pass or lay it off to anyone.
I think there are a few areas of fertile ground here: 1) the interaction between a pass and what happens next 2) body orientation 3) first touch. Watch players like Youri Tielemans and Yui Hasegawa (work your way back down the alphabet afterwards). Maybe the concept of 'usage' will get explored and refined (ported over from NBA, 'usage rate' currently refers to ending a possession through a shot or turnover).
Meanwhile pitch control models, using tracking data, are exactly what they sound like. They put a number on who has control of what part of the field and have already been used to aid player positioning/decision-making. They could become a great feedback loop for player development. That said, there's a difference between 'in the moment' decisions and 'full match' decisions. A player's assessment will change based on how tired they are/expect to be or what the feel of the game is.
Speaking of players, we've not mentioned transfers yet. Putting together a set of statistical profiles or filters is basically a norm nowadays. If nothing else, they help to focus deeper analysis of scouts/video analysts/background checks. Squad composition, or 'philosophy' more generally, is a surprisingly open question. Teams may 'want to play this way', without a clear sense of why. Meanwhile, some clubs take a strategic overview of the league or talent pool available and tailor their strategy around that.
'League adjustment' (whether a good performance in one league translates to another) is still an issue, but clubs will often have a feel for the relative quality of competitions. (There are large exceptions to this though, particularly when money is unevenly distributed). There's a statistical modelling approach to league adjustment, but some clubs are just smart in the way they focus their attention.
The largest area of study, though, will be about competitions outside Europe's elite men's football. Primarily, I'm referring to women's football, but this would equally apply to lower-tier leagues in major European countries or to the whole pyramid elsewhere. How much does set-piece effectiveness change depending on the technical skill of the takers? Is it easier to benefit from a high press in leagues where players may lack composure? What effect does it have on receivers of line-breaking passes if the pitches are low quality and the ball is bouncing?
Fortunately for fans and unfortunately for researchers, the landscape is always shifting, presenting slightly new problems. Because every team is always trying to counteract and outwit every other team.
State of play - accessing data
This is the thing which the FBref update has caused most concern about. Unfortunately, the state of play is a little bit 'it depends'. WhoScored has a good array of data - as stats and chalkboards -, including for the WSL, but not for the NWSL*. FotMob might be your best bet for team and player xG totals over a season, as well as a bunch of other stats, but with a bit less control over what you dig into. There seem to be a raft of score apps using data as a pitch to users nowadays, so it might be worth scouting around if you're curious. Both those highlighted services are decent for checking out players' season histories too.
*For NWSL, your best bet might be American Soccer Analysis (who also have stuff on MLS and USL).
If you're interested in trying analysis of your own, I do think there's something to be said for copying data into a spreadsheet by hand. Now, I know that sounds a bit 'when I was young we didn't have X and we did just fine', but think of it as how hand-writing notes aids memory retention. From personal experience, you will rarely know a set of teams or players more closely than when you're putting their stats into Excel week-by-week. If you get a community together, that can lighten the load. But make sure you focus the collection at least semi-smartly, and collect a little bit before committing to collecting a full season.
Then, if you're really eager to dive into 'analytics', there are some public datasets to explore. Statsbomb have released event data covering full league seasons to one-off cup finals. Impect released a season's worth of data. Skillcorner have released a set of data that covers tracking data and less traditional event data.
With all of these, I'd advise starting small on something you're curious about. Shot maps are always tempting, but I've always found them a little dissatisfying. I've tended to get more joy out of pass maps with a particular filter applied to them. Working with an unfamiliar set of data is like doing press-ups with one arm in a sling. Libraries like mplsoccer (Python), ggsoccer (R), d3-soccer (JS/D3) can help do some heavy lifting plotting events visually. Also check out Friends of Tracking.
State of play - accessing insight
To quote, well, myself, "You can think analytically without using data, and you can use data without thinking analytically."
It's easy to get carried away with the data itself, but the data is just a representation of football. So there are three things to understand: 1) football itself 2) the manner in which the data represents the football 3) the statistical techniques of working with data.
If you're already a mathematical person, you'll have a headstart on the third of these. The second is fairly easy to pick up with a bit of time and cross-referencing data with video footage. (A tip on that one: some of Statsbomb's open data is from FIFA World Cups, and FIFA have some World Cup games on YouTube).
The first is complicated by the fact that football itself is always evolving (although the general themes stay the same). Back in the mid-2010s, set-pieces were seen as a very inefficient method of chance creation; the recent focus on training routines has made that particular data-led analysis seem outdated.
As I say, the principles are generally consistent. If you're really interested, I'd recommend Spielverlagerung's Tactical Theory posts - starting with the oldest, because they're the broadest. The Athletic's 'How Football Works' series is quite good too.
My other advice would be to get a means of watching football matches (whether live or old games) on something where you can easily rewind by a couple of seconds. A really important part of my learning about football was spent watching clips, pausing, rewinding, watching again, rewinding, watching again, to try and work out why Player A did that instead of that.
In terms of data's interaction with football, I'd recommend the Hudl Performance Insight conference (previous editions at the Hudl Statsbomb channel). Also, Pysport's talks and output from the DTAI Sports Analytics Lab. Statsbomb's research on gender-aware modelling is also kind of a must-read. Previously-mentioned American Soccer Analysis are also a nice hub for some interesting work.
We're getting into the realm of research papers now, and if that's your thing then my starting point would be Jan Van Haaren's annual list. I will do one extra 'individual paper' recommendation, and that's this one on rest defence by Forcher et al. Rest defence is (or was, recently) a hot tactical topic, so this is an unusual case of data work being developed alongside tactical discussions at roughly the same pace. The citations/references are also good.
Post-script
Hope is not lost, even if data in tabular form is now harder to come by.
There's also this blog/newsletter, which you can subscribe to. Some representative lines from the latest post.
The idea of data staff being 'embedded' with a team's performance staff seems to be slowly growing.
A tracker in the ball and on boots (such as PlayerMaker; again, the BBC urge strikes) and you've got tracking data that can be used... anywhere?
I don't know what to do with this information other than file it under 'interesting'. I feel a similar way about La Liga putting out actual research papers.