"With what shall I fetch it..."
"In a bucket, dear Henry, dear Henry, dear Henry / In a bucket, dear Henry, dear Henry, in a bucket."
"But there's a hole in my bucket-"
A centuries-old children's song there, telling the story of Henry's battle to get a helpful answer out of ChatGPT.
AI inevitably got an outing at this week's Opta Forum, the annual event held by one of football's major data providers. Chief Scientist Patrick Lucey opened the day by weaving neatly from talking about the GPT news du jour to how machine learning applies to Opta's data and products, all tied together with a neat overarching point about how Open AI's models actually function.
These things, and ones created by Open AI's competitors, are large language models (LLMs), although with things like DALL-E GPT4's image recognition they're not really limited to words anymore. Sport though, Lucey put forward, might not be an easy place to slot these LLMs in. These things might be 'language' models, but they're generalists, and sport has a very specific language of its own. Even human coaches sometimes have to learn the new vocabulary of a club and staff when they arrive at a new job.
Chat GPT Prompt: "Write half a paragraph about how, thankfully, the data provider Opta is here to save the day with their suite of data and product development, in a lightly but friendly needling tone"
Chat GPT Output: "Well, well, well, it looks like we have a hero in the data world! Opta, the trusty data provider, has arrived to save us all from drowning in a sea of information overload. With their impressive suite of data and product development, they're like a beacon of light shining through the fog of confusion. Thank goodness for Opta!"
Ok, that's the last time I let ChatGPT write any copy. That was significantly worse than I was expecting.
It's a little funny that LLMs are entering public consciousness at this current point in football data's evolution. I think you could argue that both the large language models and tracking data are at the early stages of exponential curves in usage. Versions of them have been around for years, but now the tech is usable enough and available enough, with enough people experienced in them, for take-up to rocket.
As well as this, both offer a tantalising possibility of opening up a whole new world to people. Search engines once did this, but then search engine optimisation (SEO) utterly borked Google; not only does GPT offer to bring the world back to your fingertips, but if you don't understand any of it you can ask it to explain in more detail, or simpler detail, or in the style of a pirate. (Actually, maybe be careful about pirate styles).
Tracking data, when paired with a good event data source, does the same thing for football analytics. You no longer have to stick to on-ball events - the equivalent, perhaps, of going direct to websites you know in a browser - instead you can go "hey, is our young centre-forward lurking in the central defender's blind spot as much as we've been telling them to?". And get an answer. (With a lot of data engineering).
But with the promise of so much comes a lot of potholes.
Hey, if you've got this far I bet you're enjoying this. Subscribe to Get Goalside if you haven't already
I've been working on a project that I hope Get Goalside will see soon, using our little Chat-3PO as a helper on some areas I don't know a lot about. There are times when it can be extremely helpful. But there can be times when, like the hole in the bucket song, you go round and round with it always seeming to be on the verge of getting the point, without ever actually getting there. Part of this was because of the limitations of LLMs, but part of it was just me losing sight of what I was actually trying to do.
Perhaps counterintuitively, in the world of seemingly limitless information, knowing when something is 'good enough' becomes a skill, because it's one way to stay focused on the real problem at hand.
The stat 'progressive passes' is a good example of this, I think. Definitions vary, but they generally revolve around a certain distance to goal being achieved. The current Premier League leaders, out of players who've played 10 or more games, are Oleksandr Zinchenko, Kevin de Bruyne, Rodri, Kyle Walker, and Thiago Alcantara (stats from FBref). Five undeniably good footballers. But you might notice that at least three of them play in deeper areas of the field, on teams who are dominant in possession. Maybe they're playing against defences who are set up in a way that makes it easier to achieve these passes. What you could do is tinker with the definition so much that you try and capture all of the possibilities to iron out potential issues and weird tactical interference... or you could accept the imperfection and just keep in mind that it might do 70% of what you want it to but at least it's quick and understandable.
This is why one of the research presentations, from Guillaume Hacques, appealed to me a lot. The title was 'Destabilising a Set Defence: Identification of Symmetry-Breaking Collective Movements', but you could boil it down to the difference in the direction that the two teams are travelling in. Combine all the player locations together and you have a team's centre of mass; if one centre of mass is moving to one wing while the other is moving to the opposite, there's something destabilising happening somewhere. There'll probably be some refinement that you could do, I'm sure, but it seemed right in that sweet spot in terms of bang for your buck and focus on the question at hand.
If you enjoy Get Goalside, support the newsletter for a mere child's handful of £ per month (£2-£6)
Part of the reason why events like the Forum, and the wider analytics community, are so useful is surfacing ideas like these. Because they're hard to come up with. Football's a complex, and always subtly evolving, game and human interpretation is too. Even the companies producing the data don't always (some might say tend not to) know how best to use the data they're producing. StatsBomb*, for one, have been very open about this in their releases of new datasets, their excitement at what the data could hold, the potential that might lay within it, all part of the buzz of the announcements.
*Or '[redacted competitor]', if you prefer.
Let's compare tracking data to proto-Ultron one more time. Compared to the old world, it feels like you can ask ChatGPT and tracking data anything you want and be able to get plausible answers out of it. I should know: I have spent far, far too long getting plausible (but ultimately unhelpful) answers out of ChatGPT.
As far as my project was concerned, the most useful thing that OpenAI could have added to the tool was something that, every now and then, might say "let's talk about what you're actually trying to do here". (Although this would've also relied on the LLM having the capacity for digesting information).
I suspect that the same is going to be true with a bunch of tracking data work. To an extent, this is always the question that data people have had to ask, and it's always been up to data people in the professional game to do that questioning and parsing of language and intent.
But I also suspect that the level of questioning is also going to grow too. There's more data available of course, but nowadays, and in future, there's going to be more questioning from coaches. As players grow up with data and become more empowered/encouraged to look at it, their queries might be in the mix too.
We're going to see that good problem-solving isn't just about the data or the tech. It's about recognising at what point 'dear Liza' should've just lent Henry a knife.**
Thanks for reading. You can subscribe to Get Goalside or become a supporter through the button below
**Henry's third problem is that the straw, which Liza suggested he mend his bucket with, is too long. All his subsequent problems are about sharpening his own knife.