Here we go again: Advanced Stats and Analytics Primer, Part 1
By Ryan Hobart1 year ago
As we get ready to enter the 2021-22 NHL season, I thought it would be a good idea to re-run a primer on what advanced stats and analytical tools we have to use. So, if have ever wanted to learn more about analytics in hockey, this is the post for you.
This is essentially a reworking of a previous post I did here. If you’ve read that one already and have a good memory of it, you won’t learn anything new here. This is mostly intended for those who are looking to learn more about this area as we head into the new season. But, if you want a refresher, this is also a good place to be.
What data do we use?
Player production stats are by far the most common stats used in hockey, because they’re so abundantly accessible. There’s a bunch of different, more flavourful ways to look at this as well, which I’ll get into further down. However, primarily, publicly available hockey “advanced” statistics are based on shot attempts (a.k.a. “Corsi”). A shot attempt is any time a play flings the puck toward the net, whether that shot misses, is blocked by a player, is stopped by the goalie, or goes in the net. This is the base data set and is the starting point for how to gain a more analytical approach to hockey.
We like shot attempts more than goals or points because there’s way more data to work with. Statistical analysis is usually more valuable if you can put more data into your workspace. Because goals and points happen less often, the stats we glean from those numbers have a lot more variance, or “noise”. This noise means that it’s hard for production to be repeatable, because as we all know, there’s so many variables that go into whether a puck goes in the net or not, some of which are outside of a player’s control. With a ton of data, the effect of that noise gets diminished, so, that’s why we look at shot attempts.
We can also look at more interesting stuff like shot location, passing data, zone entry data, time-on-ice, teammates and competition effects, and even luck.
We can also do fancier things, like putting some or all of these different stats into a model to give us a more nuanced answer.
Shot attempts were traditionally broken into two types of statistics: Corsi (all shot attempts, including those that miss the net or are blocked), and Fenwick (shot attempts not including those that are blocked). You’ll find that Corsi has pretty well won out over Fenwick, but there are sections who believe Fenwick is better, at least in some cases.
Typically we see this expressed as either a percentage, like Corsi For % (shortened to CF%), or as a rate, like Corsi For per 60 minutes of ice time (shortened to CF60, or CA60 for ‘against’ instead of ‘for’). CF% is calculated by dividing the CF60 the sum of the CF60 and CA60. So, if a player was on ice for 6 shot attempts for, and 4 shot attempts against, CF% would be 6 ÷ (6 + 4) = 6 ÷ 10 = 0.6 = 60%.
The other primary thing we can do to Corsi and Fenwick is only look at when the teams are even in strength. It isn’t fair to give a team a better CF% just because they get more powerplays than the other team, or maybe have a better powerplay than the other team. Even strength CF% is typically what we’re talking about when we’re looking at these stats.
My main go-to site for this is the website Natural Stat Trick, so check them out to get started.
For some, production is the be-all and end-all of their analytical process. Even at high levels of hockey analysis, the number of points a player scores has a big impact on how valuable they’re seen to be.
It’s important to look at this the right way, though. For instance, when we talk about production for forwards, we should really only talk about “primary” points, which are just goals and primary assists. Secondary assists aren’t counted. The reason for this is that they have been shown to be a matter more of luck than skill. For defenders, this is less of a problem.
The website IcyData nicely breaks out these production stats for easy consumption, so check them out.
Similar to Corsi and Fenwick, often we want to look at points scoring as a rate. For instance, if someone scores 20 points in 500 minutes of ice-time, that’s can be seen as being just as impressive as scoring 100 points in 2500 minutes of ice-time. This is expressed as points per 60 minutes of ice-time (shortened to P60 or Points/60).
The most basic and common “model” that’s used, though it comes in many forms, is called Expected Goals (aka xG). This is an idea used in many sports, including basketball and soccer. Essentially, you look not only at how many shot attempts you made, or allowed, you also look at where they came from. The closer the shot is to the net, and the closer to the center of the ice the shot is, the more likely it is to be a goal. Typically, you give each shot attempt a number that suggests how much you would expect that shot to be a goal, depending ONLY on location.
The website Evolving Hockey has one of these xG models built into their Skater Tables, so that’s a good place to start looking. However, this model builds in more factors than just shot location (though that is the most impactful factor). If you want the theory on how that model is created, you can find that in this paper written by the creators. The image below shows the factors that go into the model:
Why does this matter?
This is obviously the critical question. Why would any of this be important to discuss? I wish I could have put this at the start, but unfortunately I had to define the terms before the importance can be explained.
Generally, we can prove that Corsi predicts future goals. The way we do this depends on the person looking at it, and I won’t belabour you with the process in this primer post. Instead, I’ll show you the results.
A former co-writer at The Leafs Nation, draglikepull, on his website hockey.greatapes, wrote up a comparison between Corsi and some popular Expected Goals models to show how good they are at predicting future results. The results are broken up as being since 2007 (when shot attempts data started being tracked), since 2009 (since shot location data significantly improved), and since 2013 (more recent). I’ll show you just the most recent data, but you can find all of the results and process at this link.
The numbers in the table is the r^2 or “fit” of the model versus goals scored. This is how predictivity is often calculated for publicly available stuff. With professional implementations, there are more sophisticated ways to calculate predictivity, but we don’t see that often at the public level.
You can see that all of Corsi (CF), Expected Goals (xGF) and Scoring Chances (SCF) respectably predict future goals, and all at around the same level. This is why we still use all of them: each model is equally valuable but might say different things about a particular team or player. In these variances we can find nuances about players and teams as they compare to each other.
Hopefully you come away from part 1 of this primer with the following understandings:
- What is Corsi, Fenwick, and Expected Goals
- What different ways can we look at scoring stats
- Why Corsi and Expected Goals are important
I will continuously use these base concepts over each of the Staturday posts, so it’s important that you understand them.
In the next part, we’re going to look at some models that are more complicated than “Expected Goals”, as well as some other interesting ways to visualize this data, and a very interesting project for manually tracking additional player data like passes and zone entries. These models and data I might use in specific circumstances where it makes sense over the course of Staturday posts, but I don’t intend to overuse them, as their value isn’t as well tested as the more fundamental stuff we talked about today.
Recent articles from Ryan Hobart