NHL Milestone 1

September 17, 2024

1. Data Acquisition

Objective

In this first section, our objective is to indicate how to get NHL “data” loaded in the memory of a Jupyter notebook. The “data” consists of all the “play-by-play” game events, for ALL NHL games of seasons 2016-2017 to 2023-2024 (including regular and playoffs games!) The “data” is loaded by a Python class; namely the NHLDataProvider class which is explained in the following sections.

1.1 The NHL API and understanding GAME_ID

In brief, to get the play-by-play data of a game, we need to send this GET command, where “GAME_ID” is a unique identifier of a NHL game.

https://api-web.nhle.com/v1/gamecenter/{GAME_ID}/play-by-play

It is important to note that the endpoint above replaces the old API endpoint GET https://statsapi.web.nhl.com/api/v1/game/ID/feed/live after the new API update by NHL.

1.2 Finding all game Ids of a season

To obtain all the gameIds for each season, our NHLDataProvider class uses the following important private methods:

The flow-diagram explains the working of these methods:

find_game_ids_flowdg

1.3 Understanding the NHLDataProvider Class and how it provides data

The NHLDataProvider class provides public methods to be accessed by the user to return raw data based on the user’s needs. This data is returned either by fetching cached-data or by returning the downloaded data (using the NHL API) when the desired data is not in cache memory.

The initializer of the NHLDataProvider class sets up the cache location as shown in the snippet below:

Class Init

When the user needs the raw play-by-play data, the class provides two public methods as follows:

Furthermore, when a game is not cached in memory, the data to be retrieved using the API is called as shown in the following snippet:

API call

Combining the information from Section 1.2, the following flow-chart explains well the Data Acquisition pipeline:

Final flow-chart

2. Interactive Debugging Tool

Description of the Debugging Tool

We have implemented 3 widgets for the selection of the game ID: the first slider represents the year, the second, the type of game (regular season or playoffs), and the third, the specific game number. The user is able to select a specific game, and the information about the home team and away team will be displayed. Afterwards, they can choose to view a specific event in the game, where they will see the coordinates of the event on the rink, along with other relevant data obtained from the JSON files.

Screenshots of the Tool

Debugger Tool 1 Debugger Tool 2 Debugger Tool 3

Screenshots of our Code

Debugger Code 1 Debugger Code 2 Debugger Code 3 Debugger Code 4 Debugger Code 5 Debugger Code 6

3. Tidy Data

3.1 Snippets of our final dataframes

3.3. Suggestion of 3 possible additional features

4. Simple Visualizations

4.1 Shot types

Important note: for this analysis, we have decided to drop shot types that have been used less than 0.1%, because they don’t represent meaningful information, especially when compared to other shot types. The shot types dropped were “between-legs” and “cradle”, with 0.06% and 0.005% usage, respectively.

2023-2044 Season Shot Types Bar plot Above, we can see 2 bar plots presenting data for shot types of Season 2023-2024. Each bar represent a shot type used by NHL players.

Bar plots are the best visualization method for this kind of data: we can clearly separate the shot-types, while the height of each bar indicates relative information between them.

The top bar plot indicates how many shots and goals were made during the 2023-2024 Season. Note that bar height includes goals (red portion) and the sum of shot-on-goal and missed-shots (blue portion). We can clearly see that the wrist shot is the most used shot type.

The bottom bar plot indicates the percentage of goals for each shot type. From this second bar plot, we can draw interesting conclusions:

4.2 Goal conversion rate vs distance

Statistics used

As for the previous section, we decided to use 3 types of events to calculate the “Goal Conversion Rate”:

The “Goal Conversion Rate” is the number of “Goals”, divided by the sum of “Goals”, “Shots-On-Goal” and “Missed-Shots”.

Each of the 3 events contain coordinates of the origin of the shot (XCoord, YCoord). The coordinates are rounded to an integer value (feet). The events also contain a field called “eventOwnerTeamId”, so we can know which team was taking the shot.

NHL Coordinate system

The following image indicates the dimensions of an official NHL ice rink: NHL ice rink layout

We were able to deduce the following information:

Challenge of calculating the distance

The distance of a shot is “simply” the Euclidean distance between the origin of the shot and the center of the target goal. However, there is challenge here: we have to find which of the 2 goals is the target goal!

For season 2020 and above, this was quite easy: each game contained a field “homeTeamDefendingSide”. With this information, we could deduce which teamId was on which side at the beginning of the game. For seasons 2019 and earlier, we found that we could look at the first event “shot-on-goal”, for which the zone of “Offensive”. For this shot-on-goal, we look at the XCoord value. If it was positive, it meant that the team taking the shot was aiming at the goal on the right side of rink.

Now that we knew on which side was the “Home Defending Side” at the beginning of the game, we had to be careful to switch the goal side at each period!

Invalid events

When we have parsed the data for seasons 2018, 2019 and 2020, we discovered that some events had bad data. For example, sometimes, the coordinates were missing or had Nan values. Since this represented less than 1% of the events, we decided to just drop those events to complete our analysis.

Results

Goal Conversion Rate for Seasons 2018, 2019 and 2020

As we can clearly see, the goal rate is inversely proportional to the distance to the goal The results are very similar from one season to another. The reasons are simple:

The methodology we used to produce this graph is:

Filtering shots “too far”

When we started to analyze the results, we found out that shots taken from a distance higher than 70 feet had a “noisy” goal conversion rate. Sometimes, the rate seemed to be higher than shots taken from much closer to the goal. After digging, we understood that

So, we decided to remove those goals from our final result graph, because they introduce noise and don’t give meaningful information

4.3 Shot vs distance and shot-type

We have used a similar methodology as in the previous section:

Using this methodology really helps to reduce the noise of our figures.

Goal Conversion Rate for Season 2023, per shot type Goal Conversion Rate for Season 2023, per shot type

We first started to work only with the top graph, which shows a line for each shot type. We can see some patterns emerging, but it is not very clear where multiple lines cross.

Then, we decided to create a 1D heat-map of the Goal Conversion rate in function of the distance of the shots. The heat-map has colors from white to red, and we saturated the red when the goal conversion rate was higher than 25%. With those heat maps, we better see patterns for the different shot types

5. Advanced Visualizations: Shot Maps

5.1 Offensive Shot Maps from 2016 to 2020

For the advanced visualisations, we have decided to include missed shots in our calculations to get a more complete picture of offensive performance.

Select the season to display