‘Staturday’ Weekly Column #1: Where do we start? Analytics Primer, Part 1
By Ryan Hobart2 years ago
Have you ever wanted to learn more about analytics in hockey? Have you ever been curious about what stories might lie in data about your favourite hockey team that you may not be aware of? If so, welcome to Staturday!
This is the initial post of this now weekly column. The goal is to find interesting stories in the publicly available data about the Toronto Maple Leafs, Toronto Six, Team Canada, or any of the other teams this site has the pleasure of covering.
What data do we use?
Primarily, publicly available hockey “advanced” statistics are based on shot attempts (aka Corsi). This is the base data set. There’s no reason to limit the discussion to this though. We can also talk about player production (we’ll get to the many ways we can look at this), shot location, passing data, zone entry data, time-on-ice, who the player is playing with or against, and even luck. We can also do fancier things, like putting some or all of these different stats into a model to give us a more nuanced answer.
To summarize, shot attempts were traditionally broken into two types of statistics: Corsi (all shot attempts, including those that miss the net or are blocked), and Fenwick (shot attempts not including those that are blocked). You’ll find that Corsi has pretty well won out over Fenwick, but there are sections who believe Fenwick is better, at least in some cases.
Typically we see this expressed as either a percentage, like Corsi For % (shortened to CF%), or as a rate, like Corsi For/Against per 60 minutes of ice time (shortened to CF60 or CA60). CF% is calculated by dividing the CF60 the sum of the CF60 and CA60. So if a player was on ice for 6 shot attempts for, and 4 shot attempts against, CF% would be 6 ÷ (6 + 4) = 6 ÷ 10 = 0.6 = 60%.
The other primary thing we can do to Corsi and Fenwick is only look at when the teams are even in strength. It isn’t fair to give a team a better CF% just because they get more powerplays than the other team, or maybe have a better powerplay than the other team. Even strength CF% is typically what we’re talking about when we’re looking at these stats.
My main go-to site for this is the website Natural Stat Trick, so check them out to get started.
For some, production is the be-all and end-all of their analytical process. Even at high levels of hockey analysis, the number of points a player scores has a big impact on how valuable they’re seen to be.
It’s important to look at this the right way, though. For instance, when we talk about production for forwards, we should really only talk about “primary” points, which are just goals and primary assists. Secondary assists aren’t counted. The reason for this is that they have been shown to be a matter more of luck than skill. For defenders, this is less of a problem.
The website IcyData nicely breaks out these production stats for easy consumption, so check them out.
Similar to Corsi and Fenwick, often we want to look at points scoring as a rate. For instance, if someone scores 20 points in 500 minutes of ice-time, that’s can be seen as being just as impressive as scoring 100 points in 2500 minutes of ice-time. This is expressed as points per 60 minutes of ice-time (shortened to P60 or Points/60).
The most basic and common “model” that’s used, though it comes in many forms, is called Expected Goals (aka xG). This is an idea used in many sports, including basketball and soccer. Essentially, you look not only at how many shot attempts you made, or allowed, you also look at where they came from. The closer the shot is to the net, and the closer to the center of the ice the shot is, the more likely it is to be a goal. Typically, you give each shot attempt a number that suggests how much you would expect that shot to be a goal, depending ONLY on location.
The website Evolving Hockey has one of these xG models built into their Skater Tables, so that’s a good place to start looking. However, this model builds in more factors than just shot location (though that is the most impactful factor). If you want the theory on how that model is created, you can find that in this paper written by the creators. The image below shows the factors that go into the model:
Why does this matter?
This is obviously the critical question. Why would any of this be important to discuss? I wish I could have put this at the start, but unfortunately I had to define the terms before the importance can be explained.
Generally, we can prove that Corsi predicts future goals. The way we do this depends on the person looking at it, and I won’t belabour you with the process in this primer post. Instead, I’ll show you the results.
A former co-writer at The Leafs Nation, draglikepull, on his website hockey.greatapes, wrote up a comparison between Corsi and some popular Expected Goals models to show how good they are at predicting future results. The results are broken up as being since 2007 (when shot attempts data started being tracked), since 2009 (since shot location data significantly improved), and since 2013 (more recent). I’ll show you just the most recent data, but you can find all of the results and process at this link.
The numbers in the table is the r^2 or “fit” of the model versus goals scored. This is how predictivity is often calculated for publicly available stuff. With professional implementations, there are more sophisticated ways to calculate predictivity, but we don’t see that often at the public level.
You can see that all of Corsi (CF), Expected Goals (xGF) and Scoring Chances (SCF) respectably predict future goals, and all at around the same level. This is why we still use all of them: each model is equally valuable but might say different things about a particular team or player. In these variances we can find nuances about players and teams as they compare to each other.
Hopefully you come away from part 1 of this primer with the following understandings:
- What is Corsi, Fenwick, and Expected Goals
- What different ways can we look at scoring stats
- Why Corsi and Expected Goals are important
I will continuously use these base concepts over each of the Staturday posts, so it’s important that you understand them.
In the next primer, we’re going to look at some models that are more complicated than “Expected Goals”, as well as some of the more scouting-based tracking data that is publicly available. These models and data I might use in specific circumstances where it makes sense over the course of Staturday posts, but I don’t intend to overuse them, as their value isn’t as well tested as the more fundamental stuff we talked about today.
That said, enjoy your week, and I’ll see you next Staturday.
Recent articles from Ryan Hobart