How To Use DuckDB to Analyze NBA Data
Plus a chart on which teams are making fans stay in their seats the longest
Howdy everyone. Today’s premium edition of The F5 begins with a data viz showing which teams are making fans stay in their seats the longest. After that I’ve got a tutorial on how to use DuckDB to work with NBA play-by-play data.
If any of that interests you consider a paid subscription to The F5 for the low cost of one Tecate per month. Paying subscribers to The F5 receive an extra email every Friday with additional charts and analysis plus tutorials on how to work with and visualize NBA data in R. In addition, you get access to the full archive of tutorials and first dibs on any data product I create like The Best of Yesterday web app.
Know someone that might be interested in learning how to code in R and/or work with NBA data? Think about buying them a gift subscription.
A friend of The F5 recently reached out and asked if I knew which team averages the longest games. I had never really thought about it before but I knew how to find the answer.
For the last couple of weeks I’ve been noodling around in the NBA’s play-by-play data which contains real time stamps for every in-game action. I’ve used that data to reveal what time games actually start and why I think games are too long.
So my friend’s query was easy enough to answer because I knew where to look. The chart below shows each team’s average game length this season. Teams are sorted from to longest to shortest average game lengths (overtimes are not included in these calculations).
The average Orlando Magic game takes about 2 hours and 20 minutes to finish. That’s a long time to be forced to watch one of the worst shooting teams of all time.
Meanwhile, the Chicago Bulls finish four quarters of basketball just under 2 hours and 11 minutes on average. Very kind of Chicago to not subject fans to more Bulls basketball than necessary.
I’ll file this under, “cool but useless”, but hopefully it answers my friend’s original question.
Do you have a question for the The F5? Leave a comment, send me an email, or slide into my DMs. I’ll try my best to give you an answer.
How To Use DuckDB and R to Analyze NBA Data
I’m not a smart man. So when I tell you DuckDB changed my life I really do mean that. DuckDB has saved me so much time that it’s hard for me to believe I used to live without it.
DuckDB is an in-process SQL OLAP database management system. Don’t worry if you don’t understand what that means. All you need to know is that DuckDB just works and once you start using it you’ll never go back.
I use DuckDB most often when I’m working with NBA play-by-play (PBP) data. PBP files can be relatively large (100MB) and loading even one of them into your R session can be a pain (let alone 25 of them). But with DuckDB it’s a breeze. That’s because you can use DuckDB to filter and manipulate the data before it ever even touches your R session. It honestly feels like magic.
I’m going to show you how I used DuckDB to calculate the average real time length of NBA games for each season since 2020-21. It’s a simple example but I think it does a good job of illustrating why DuckDB is a helpful tool for any data professional.
So let’s get right into it.
Start by loading (or installing if you haven’t already) the following packages.
#remove the # if you need to install these
#install.packages("tidyverse")
#install.packages("duckdb")
#install.packages("DBI")
# load packages
library(tidyverse)
library(duckdb)
library(DBI)
The next thing we’re going to do is download a few seasons of NBA play-by-play (PBP) data. For that, we’re going to use Vladislav Shufinskiy’s GitHub repo where he’s already assembled the data for us. He’s even written a nifty function that pulls the data from the repo and saves in your working directory
Before we download the data, we need to create a folder called [data] with a subfolder called [cdnnba] inside our working directory. Once you’ve done that you can run the following code: