Descriptive first, prescriptive later

How to build out analytic layers

So far we’ve discussed generally what analytics and deep analysis is all about (be less wrong), and the beginning process of how to get there (use divergent and integrative thinking).

This article is aimed at determining how a team can create a fruitful vein of data throughout an organization.

First, be descriptive

The steps or hierarchy for effectively leveraging data are: descriptive - predictive - prescriptive, and each step kind of builds on the other.

First, you want to have enough clean information such that you can more effectively describe reality. This can be accomplished by scraping and collating public data, then arranging it in easily accessed and parsable databases. Think: Natural Stat trick.

Of course, this can also mean simply counting things that happen and creating proprietary databases yourself. See: Corey Szanjder’s tracking project.

Aside from the obvious benefits of having quick, easy access to data, the major reason to develop a set of descriptive stats is to help understand base rates and establish your priors. 

This is the rock upon which your “data analysis church” is built. Descriptive stats are not only the raw material for decision making and model building, they help create boundaries and context around which assumptions, theories, and tests can be constructed.

Descriptive stats can help answer basic, but vital, questions like:

  • How often does “X” happen?

  • What is the average incidence of “N” in circumstances “X, Y, and Z”?

  • What is the variance of this result over time or in this particular population sample?

To get less abstract, a good descriptive database can tell you that, in the NHL, an elite level of offensive contribution is 3.00 even-strength points per hour, while 1.80 ESP/60 is merely “good”, and anything around 1.00 and below is near bottom-of-the-league amongst forwards.

This all seems rather obvious, but it’s a first hurdle a lot of teams have stumbled at in the very recent past. Effectively creating, assembling, cleaning, sorting, and parsing vast swaths of information is fundamental, but it can be a significant challenge as well.

Next, get predictive

Once you have a good descriptive base, the next step is testing things for correlations, causation, and impacts.

This is the process of using more advanced statistical methods to try to predict the future. With descriptive stats, you look at what happened in the past. In the predictive step, you try to take those results and determine what is more likely to happen in the future.

This is where things can get messy. Predictive models tend to be based on a particular theory or insight, and then use a variety of weighted factors to try to elicit bundled outputs or probabilities.

This is also where a lot of conventional or traditional thinking violently collides with “fancy stats” for a couple of reasons:

1.) Models built on seemingly arcane statistical tools like regression appear to be “black boxes”, whose assumptions and weights are mostly opaque to the layman.

2.) The outputs from these models that fly in the face of what seems both widely known and plainly obvious tend to get the most attention.

These factors combine to create the perception of a black box packed with random numbers that is arbitrarily tuned to spit out apparent nonsense.

And it’s true that a given model’s outputs don’t necessarily and consistently accord with reality. But, it helps to keep in mind that:

  • Models are never 100% “right”, but the best models are “useful”

  • In the quest to be non-consensus right, a model’s unconventional results could prove to be its most valuable insights

  • It is vital to understand the assumptions and methods that underpin a model, so that you can also understand its potential blind spots and limitations.

For example, Byron Bader’s prospecting model, built on age and NHLe factors, is notoriously ambivalent about defenseman Adam Fox. This seems like a “miss” that invalidates the model because Fox has entered the league as a fully formed star (and the player incredible results in College before graduating).

However, Bader discussed this apparent oversight on Twitter recently:

Bader actually wrote that Fox was going to be a star a few years ago. And then he built a model that claims that Fox wasn’t likely going to be a star! Nevertheless, for my money Bader’s prospecting tools are some of the most useful currently available in the public domain.

This apparent contradiction reiterates that models are like special googles that you put on to help display a particular aspect of reality. Their utility is dependent on what they are illuminating, how, and where their potential blindspots are. Kind of like a pair of infared glasses at night (just don’t wear them in the sunshine though).

Mount Everest, or getting prescriptive

You have your well sourced and well managed descriptive databases. You have established priors, base rates, and data context. You’ve used them to create a handful of usefully predictive models.

The last step is to leverage it all to become prescriptive. Prescriptive analysis can take large volumes of data and spit out specific tactics or courses of action.

I’ll illustrate with a hypothetical.

Let’s say you’re the head amateur scout of an NHL organization. You want to build a strong analytical backbone that will help guide your department’s decisions moving forward.

Descriptive layer - Create databases for individual prospect counting stats, demographic and attribute profiles (age, league, size, etc.), as well asl relevant microstats if possible.

Predictive layer - Create prospect probability and draft position value models that help forecast the potential value of a given pick or the probability a certain player will become an NHL star.

Prescripive layer - Leverge your data and models to tune an AI that will create recommended ranked draft listings using all available draftable players. Deep dive into recommendations that seem counter to convention or your own intuitions. Create a system where you can feed in scenarios to see how/if the recommendations change.

To summarize, a full, integrated stack of data and analytics tools will look something like this (via Alteryx):

What I’m reading:

Alison Lukan on the secret strength of Oliver Bjorkstrand’s game. Bjorkstrand is an intriguing player in that his underlying number profile resembles that of a star, but he has mostly been deployed as a second or third line player in Columbus so far.

Bjorkstrand was a big scorer in junior and collects points at an efficient clip as well. Alison does a deep dive into Bjorkstrand’s defensive metrics here, however, to show that the player seems to have a huge impact south of the red line as well.

Ryan Clark on wow the Seattle Kraken are building a league-leading analytics staff ($ - paywall alert). Will the newest NHL team’s large, diverse, and talented analytics department given them a leg up when they finally hit the ice (spoiler: almost certainly yes).

Loserpoints argues that standing points above replacement (SPAR) is a more intuitive metric than wins above replacement (WAR).

Recommended tools - MyMind. I collect a lot of references and links in my virtual travels. MyMind is a new tool that acts as a kind of convenient “dumping ground” for useful stuff. It includes an easy search feature built on AI that assists in sorting through whatever you put there as well.