Bell curves and bias: how percentiles can strip away context in hockey stats

Jon Steitzer
18 days ago
I’m sorry that if talking about hockey analytics bores you so much that you’ve immediately unbookmarked this site and thrown your laptop into the trash so you never need to risk seeing someone mention Corsi or expected goals again. To be fair, those of you who anger easily at stats might actually agree with some of what I’m saying because I think most people look at stats wrong and this largely not accounting the context in percentile stats.
Below is probably the most frequent example of percentiles in a player card:
I’ll save my thoughts on WAR stats for another day as I don’t put much stock in what is largely a black box of weighted numbers applying universally to all players, but instead we’ll focus on my main issue at hand and that is percentiles as the means for determining who is good and who is bad. And while I agree that Chris Tanev is good, I wouldn’t put him in the top 10 percentile of NHL players like the projected WAR% number leads you believe. While the pinned tweet statement acknowledges that the numbers are neither definitive or predictive, that again only adds to the “for entertainment purposes only” aspect of the numbers, but again, this isn’t about WAR player cards, it’s about percentiles in general.
I’m going to step away from the player cards and instead look at some specific stats and how percentiles apply to them. And while I’m sure that there are many people who have happily moved on from Corsi For % as their means of evaluating on ice performance, it is one of the easiest ways of demonstrating what I’m talking about.
The graph below shows the 5v5 Corsi For% for all NHL Forwards this season with 300 minutes played. The green represents the top 90th percentile, the red represents the bottom 10 percentile, and the yellow represents the 25th, 50th, and 75th percentile. The bars are counts of the number of players with that CF% rounded to the nearest whole:
As you can see the bell curve demonstrates the issue with the ranking type of interpretation of percentile driven data. What we see are a few extreme numbers with 7 players below 40% CF% and 9 players above 60% CF% out of 370 skaters. There is only a 4% gap between the 10th and 25th percentile and a 5% gap between the 75th and 90th percentile. 50% of NHL forwards fall between 47% and 53%. Given that it’s percentiles of a percent driven stat it’s really not a surprise that the bell curve is pretty strong on this, but when people are interpreting this data it’s interesting that the 80th percentile is going to be treated as the top of the league while the 60th percentile is considered somewhat average and that is 2.8% percent difference.
Going back to WAR as an example, 95% of the league last season fell in the WAR range of -1.8 to 2. That’s 846 skaters in a range of 3.8 and percentiles differences being anywhere from 0.01 to 0.09 in difference through that 95% of the league. That makes for shaky analysis when valuing Sam Steel in the 74th percentile of WAR at 0.59 and contrasting him to Jakob Chychrun who had 0.11 WAR and was in the 49th percentile of the league.
Looking at the NHL Edge data there is certainly interest in the data as it gives us a lot of context around the speed of the skaters and how strong their shots are, etc, but they too present their data with percentiles an if a player falls in that unfortunate bottom 50th percentile they may be judged harshly. This summary of the fastest forwards in the NHL this season rounded to the nearest 0.5 shows how little variance there is and the percentiles certainly over state any concern that should exist.
The 25th-75th percentile range being only a difference of one mile per hour not only shows the ridiculousness of percentiles but maybe quells some of the issues around speed in the NHL. There are a few players like Pat Maroon who legitimately warrant concerns about how fast they are generally we are looking at a league that is capable of hitting similar top speeds. The 100 percentile Owen Tippett to 75th percentile Jack Eichel difference is less than 1 mph as well.
So while percentiles are a perfectly fine quick glance, it’s a first step and that’s something I saw put well by Garret Hohl of Hockey Graphs recently in regard to player cards. For a quick and dirty summary, you can see who is better than others at something but if you are doing analysis, it probably shouldn’t end there.
In a sport like hockey where context and the number of variables keep the eye test relevant and an important partner to statistical analysis it doesn’t make sense to add additional barriers to the understanding of players by just presenting the percentile. After all, most of the time you are already heat mapping the value to capture the percentile, doesn’t it make more sense to just provide the actual value?
Data from Puckalytics ,and Patrick Bacon’s Wins Above Replacement dashboard

Check out these posts...