The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games.At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014 Chapter 2).
include_graphics(c("los_angeles-dodgers-stadium.jpg",
"Los-Angeles-Dodgers-Promo.jpg",
"adrian_bobble.jpg"))
Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)
The 2012 season data in the events
table of SQLite database data/dodgers.sqlite
contain for each of 81 home play the
We will use R
, RStudio
, R Markdown
for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online
R
and RStudio
,R Markdown
to interact with R
and conduct various predictive analyses.All materials for the next three weeks will be available on Google drive.
Connect to data/dodgers.sqlite
. Read table events
into a variable in R
.
Read Baumer, Kaplan, and Horton (2017, Chapters 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.
Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL
for real projects.
Open RStudio
terminal, connect to database dodgers.sqlite
with sqlite3
. Explore it (there is only one table, events
, at this time) with commands
.help
.databases
.tables
.schema <table_name>
.headers on
.mode column
SELECT ...
.quit
Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.
R
(likewise, Python
) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R
or Python
to fit models to data.
# Ctrl-shift-i
#library(RPostgreSQL)
library(RSQLite) ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite") # read Modern Data Science with R for different ways to connect a database.
## dbListTables(con)
events <- tbl(con, "events") %>%
collect() %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT")))
# events %>% distinct(month)
# events$day_of_week %>% class()
# events$day_of_week %>% levels()
# events
What are the number of plays on each week day and in each month of a year?
Check the orders of the levels of the day_of_week
and month
factors. If necessary, put them in the logical order.
How many times were bobblehead promotions run on each week day?
How did the attendance vary across week days? Draw boxplots. On which day of week was the attendance the highest on average?
Is there an association between attendance and
Is there an association between attendance and temperature?
Regress attendance on month
, day of the week
, and bobblehead
promotion.
Is there any evidence for a relationship between attendance and other variables? Why or why not?
Does the bobblehead promotion have a statistically significant effect on the attendance?
Do month and day of week variables help to explain the number of attendants?
How many fans are expected to be drawn alone by a bobblehead promotion to a home game? Give a 90% confidence interval.
How good does the model fit to the data? Why? Comment on residual standard error and R\(^2\). Plot observed attendance against predicted attendance.
Predict the number of attendees to a typical home game on a Wednesday in June if a bobblehead promotion is extended. Give a 90% prediction interval.
Include all variables and conduct a full regression analysis of the problem. Submit your R markdown
and html
files to course homepage on moodle.