8 min read

'Analytics history' - there is so much to write about and share

This piece has been re-written about four times.

At one point, it opened by using The Verge's fun tech-history podcast series, Version History, as an analogy. At another point, it was bookended by referencing 'The Dark Ages' and 'The Renaissance', talking about how the naming of time periods is as much about the era that the names were coined as it is about the periods of history themselves.

Instead, you get the eventual fruit of trial and error, a muddle when things don't fit neatly into place. That's fitting, really, because that's what the history of 'football analytics' is really like.

The popularly-told version tends to go a little like this:

A long time ago, in a galaxy far, far away (1950s Swindon), RAF accountant Charles Reep collected some data. He was opinionated and wrong, but somewhat influential. Later, Prozone and Opta were founded in England. Opta 'created' expected goals, and Liverpool used it to hire Klopp, sign Salah, and win the Champions League and Premier League. (Brighton and Brentford also used data, because their owners were into betting, but no-one there talks on the record).

The Get Goalside version - for example, in 'The path to now' - has tended to go like this:

We have football datavis dating back to 1920s Hungary, a generation before Charles Reep began collecting his data. He was overly critical of possession football, but had decent ideas, and his dataset was used to create an xG-lite model in a 1990s paper. A full generation later we got Twitter Analytics, xG popularity exploded, coinciding with its role in Liverpool hiring Klopp, signing Salah, etc etc. The nerds won, but history rolls on.

Neither version is good.

By an extremely strange coincidence, one of the issues with both versions traces back to the very same city held up as the Face of Analytics: Liverpool. But not the Liverpool of Klopp, Salah, not even the red side of Liverpool at all. We're talking Everton. And we're talking the 1970s.

Early in that decade, an agreement between Dr Vaughan Lancaster-Thomas and Toffees chairman John Moores allowed a (coincidentally) Everton fan Thomas Reilly to help monitor physical and running-based metrics of players. Reilly - by all accounts an athletic man himself - took physical and running measurements about Everton players that would nowadays be considered routine, but back then most certainly wasn't.

"[T]he statistics they gathered from pre-season were useful in a few ways," Joe Royle, Everton striker at the time, told The Athletic in 2020. "If you had a player who got injured during the season, did rehab and got back to what they thought was full fitness, they could say, ‘Well, this is what you were doing pre-season’ and compare the numbers.” This was the early '70s![1]

In the middle of the decade, Liverpool Polytechnic (the establishment that Vaughan-Thomas was connected with, later named Liverpool John Moores University) launched 'the first BSc (Hons) degree in sports science' [quoting from LJMU themselves].

By that time, Mohamed Salah's birth was still almost two decades away. Yet, coincidence again: this period - the late '60s to early '70s, the era of Bobby Charlton and Gerd Müller - was also the time that two statistical papers using Charles Reep's shot location and passing sequence data were published; co-authored by Bernard Benjamin (1968 and 1971) and Richard Pollard (1971).[2]

Those papers would create a ripple that is, at the very least, detectable to a medium-grade Google Scholar search. For example, there's debate about the statistical pattern that goal-scoring follows, with two seemingly unrelated pieces of work in the early '80s arguing for a different tack to the Reep-involved papers.[3]

Later that decade, now-Professor Thomas Reilly helped convene the first World Congress of Science and Football. It would be held every four years (the year following a men's World Cup), and would be a place for research and researchers of all football codes: association, rugby (both types), American, Aussie rules, Gaelic, futsal. The congress - and subsequent editions - would have a focus on various features, from the 'measure lactic acid' end of sports science, to what'd now be called 'event data' studies, to psychology, to kinematics. And even, in 1987's first edition, a coinciding with a burgeoning interest in computer-aided analysis systems.[4]

In the Get Goalside world of football analytics, there's always been a special place for the Forums put on by Opta [later Stats Perform, and later, really, superseded by StatsBomb, now part of Hudl]. The presenter alumni of the Forums would now count multiple heads of analytics departments in their number, not to mention others who work elsewhere in the game. (Including, I suppose, myself).

Besides providing 'outsiders' with the opportunity to rub shoulders with folks currently working at clubs, the Forums were also a hub for work and for connections of all kinds. That's probably, really, their biggest benefit. The Science and Football congresses seem to have had a similar effect, quarter of a century earlier. If you find an interesting football statistics paper from the '90s or 2000s, chances are that a paper from a Science and Football congress will be in its citations.

And there are interesting papers. To throw some examples from the sixth Science and Football congress (2007), there's 'Analysis of actions ending with shots at goal in the Women’s European Football Championship (England 2005)', by Józef Bergier, Andrzej Soroka and Tomasz Buraczewski. And: 'Match analyses of Australian international female soccer players using an athlete tracking device', by Adam Hewitt, Robert Withers, and Keith Lyons.[5] The year 2007 wasn't just ahead of the trend in terms of sports science on women's football, but it was early days for GPS tracking devices full stop.

There's a wealth of interesting stuff from Spanish researchers too. For example, 'Use of the polar coordinates technique to study interactions among professional soccer players' (2002), by Carlos Lago Peñas and M. Teresa Anguera Argilaga, which plots interactions between players in a 'passing sonar' style well before it became popular online. Or, 'Análisis de las posesiones de balón en fútbol: frecuencia, duración y transición' (2008) by Julen Castillano, looking at the number of possessions per game and their durations, and comparing them with studies stretching back to the '90s.

The Spanish aren't singled out for any greater reason than some of the work was easy to trace online through the citation trail. The contents of Science and Football congresses (easier to find online than the full books of papers, published after the events) reveal a range of other nationalities, judging by the imperfect metric of surnames and countries cited in paper titles. The Get Goalside assumption holds: there've always been folk who've wanted to apply some rigorous counting to football.[6]

Why, then, is this rich seam of statistical exploration into football not more well-known?

Partly, because studies were damn hard to do before modern companies were able to take the task of data collection off researchers' hands, and that naturally limited what could be explored.[7] Take the following passage, from a paper at 2007's sixth Science and Football congress, on turn demands in the Premier League by Bloomfield, Polman, and O'Donoghue:

The on-field activity of 55 FA Premier League soccer players was recorded from Sky Television’s PlayerCam facility for approximately 15 minutes each. The 15 minutes recorded was reduced to approximately 5 minutes per player by only including video sequences where the player was in possession of the ball, [et cetera...]

Much of the work (that I've been able to read) from the 'pre-Opta era' has a similar problem, and similar sections devoted to the mode of data collection. Often there are theoretical justifications for the choices behind the method. Say what you like about the specs of modern data collection companies, at least they give you a framework to start from. It may not be an ideal one - it may even be one that diverts attention from interesting areas of research - but in many cases it will have been better than starting from scratch.

For the history of analytics to be told in the mainstream, it needed a narrative, a cause-and-effect, a directional arc. That's part of the success of Moneyball; the book (moreso than the movie) tells a neat tale of data's entry into sport, from box scores to Bill James to his adherents to Billy Beane.

And the Liverpool story is the closest that football got to that. Against the odds (remembering that the club were in a tough spot in 2010), they won trophies with a manager tainted by a catastrophic season and talismanic players whose purchases were derided; they did so in part because of a faith in analytics, in expected goals; and between xG and Michael Edwards, the go-between for the data-heads and the manager, the story traces back to Opta (who popularised xG) and Prozone (where Edwards worked in the 2000s).

The line about history being written by the victors is only half-right. Sometimes history is what's actively remembered, sometimes it's shaped by what is deliberately suppressed, and sometimes by what's merely forgotten. Evolution follows a similar pattern; it led to opposable thumbs and complex brains, and also the platypus.

A building at Liverpool John Moores University is named after the late Professor Thomas Reilly. There's no significant reason why the history of football data became so synonymous with expected goals and not with the work of Reilly's precursors, peers, and intellectual descendants. The separation of 'sports science' history and 'analytics' history walks, talks, and sounds like a platypus.

Ending on 'platypus' - maybe this could've done with one more draft...


Footnotes

[1] The Athletic's article on Thomas Reilly's time at Everton, which inevitably has a certain word in the headline: 'How ‘private boffins’ helped Everton become sports science pioneers' - Kudos to The Athletic for the article.

[2] Charles Reep's data papers - 'Skill and Chance in Association Football' in 1968 (Benjamin, Reep); 'Skill and chance in ball games' in 1971 (Benjamin, Pollard, Reep). Like Reep, Benjamin had worked in the RAF, joining towards the end of the Second World War as a statistician, and would become president of the Royal Statistical Society (among many other things). Pollard has had a life worth reading about too.

[3] 1980s debate about whether goals follow a Poisson distribution - 'Is goal scoring a Poisson distribution?', 1981 (Colwell, Gillett) in The Mathematical Gazette; 'Modelling association football scores', 1982 (Maher) in Statistica Neerlandica. Pollard, who co-authored the '71 paper with Benjamin and Reep, responded to the discussions.

[4] The World Congresses on Science and Football - I'd been working on this post for a while before digging into these and ideally I'd spend a lot more time reading the proceedings from them. But I know a rabbit-hole when I see one, and unfortunately this would be a big one.

[5] Keith Lyons - This post owes a large debt to the late Keith Lyons, whose blog digitised and/or collected together a number of PDFs, including Reilly's 1975 PhD thesis. Other work is harder to come by. To take one example, an intriguing paper that Lyons references in a blog - 'Human Factors in Sports Systems: An empirical investigation of events in team games' (1983) is available online, but paywalled by Sage Journals. Working through the references of Reilly's PhD thesis, things quickly become harder to trace (though obviously, in both cases, access to an academic library would help). It would have been a lot harder for me to stumble on the lines of enquiry in this piece were it not for Lyons' commitment to recording his memories, and the work of his peers.

[6] The 'Get Goalside' theory of nerds being into football for as long as football has existed - It's hardly a take worth a victory lap: association football, as a sport, is a century-and-a-half old. It predates a dozen American states' admission to the Union; predates the unification of both Germany and Italy, countries that have won eight men's World Cups between them. It was codified closer to the time of Beethoven than the era of World Cups themselves.

[7] The difficulty of a decent dataset in the 'pre-Opta era' - It feels pretty reasonable to say that a significant factor in Charles Reep's footprint on history is the size of the dataset he developed over the years. It also seems reasonable to assume that this was helped by his contact with Wolves boss Stan Cullis in the early 1950s, shortly after his data collection started. It's hard to understate the effect that a show of belief from people in the profession can have on an 'outsider'. If you'd like a look yourself, Keith Lyons (see above) gathered together the data tables from the '68 Benjamin and Reep paper in a GitHub repository, stored as CSVs: https://github.com/2622NSW/Reep-and-Benjamin

"There is an opportunity to extend these stories and provide thick description of a pivotal moment of sport analytics in England. It requires a comprehensive, co-operative story-making effort. The outcome could be an inclusive and participatory account that is reflective and critical. [...] There is so much to write about and share."

-- Keith Lyons (1952-2020), writing in 2019