The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games.At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014 Chapter 2).
include_graphics(c("los_angeles-dodgers-stadium.jpg",
"Los-Angeles-Dodgers-Promo.jpg",
"adrian_bobble.jpg"))
Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)
The 2012 season data in the events
table of SQLite database data/dodgers.sqlite
contain for each of 81 home play the
We will use R
, RStudio
, R Markdown
for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online
R
and RStudio
,R Markdown
to interact with R
and conduct various predictive analyses.All materials for the next three weeks will be available on Google drive.
Connect to data/dodgers.sqlite
. Read table events
into a variable in R
.
Read Baumer, Kaplan, and Horton (2017, Chapters 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.
Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL
for real projects.
Open RStudio
terminal, connect to database dodgers.sqlite
with sqlite3
. Explore it (there is only one table, events
, at this time) with commands
.help
.databases
.tables
.schema <table_name>
.headers on
.mode column
SELECT ...
.quit
Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.
R
(likewise, Python
) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R
or Python
to fit models to data.
# Ctrl-shift-i
#library(RPostgreSQL)
library(RSQLite) ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite") # read Modern Data Science with R for different ways to connect a database.
## dbListTables(con)
tbl(con, "events") %>%
collect() %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>%
mutate_if(is.character, factor) %>%
mutate(temp = round((temp- 32)*5/9)) -> events
# events %>% distinct(month)
# events$day_of_week %>% class()
# events$day_of_week %>% levels()
# events
# summary(events)
events %>%
count(bobblehead, fireworks)
## # A tibble: 3 x 3
## bobblehead fireworks n
## <fct> <fct> <int>
## 1 NO NO 56
## 2 NO YES 14
## 3 YES NO 11
Table 1 and 2 summarize the number of games played on each weekday and month.
events %>%
count(day_of_week, month) %>%
pivot_wider(names_from = day_of_week, values_from = n) %>%
pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")
month | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---|---|---|---|---|---|---|---|
APR | 1 | 2 | 2 | 1 | 2 | 2 | 2 |
MAY | 3 | 3 | 2 | 1 | 3 | 3 | 3 |
JUN | 1 | 1 | 1 | 1 | 2 | 2 | 1 |
JUL | 3 | 3 | 2 | NA | 1 | 1 | 2 |
AUG | 2 | 2 | 3 | 1 | 3 | 2 | 2 |
SEP | 1 | 1 | 1 | 1 | 2 | 3 | 3 |
OCT | 1 | 1 | 1 | NA | NA | NA | NA |
events %>%
ggplot(aes(day_of_week)) +
geom_bar(aes(fill=month))
Figure 2: Barplot of counts of games for each weekday and month
Figure 3 shows your friend’s (very good) suggestion of headmap of total attendance versus weekday and month. The colors chabge from bright yellow to dark red as attendance increases. Default heatmap shuffles rows and columns so as to bring together weekdays and months with similar attendance. Here we see May, Aug, and Jul together within the months and Saturday, Friday, Sunday within the weekdays. Learn more about xtabs (cross-table) heatmap by typing ?xtabs
and ?heatmap
in the R console.
xtabs(attend ~ day_of_week + month, events) %>%
heatmap()
Figure 3: Heatmap of attendance versus weekday and month.
In Figure 4, I have added one more aes (colour) to capture day_night information. To avoid overplotting, I replaced geom_point()
with geom_jitter()
. These plots were also illuminating. So let us thank your friend who suggested this one, too.
sum_attend <- events %>%
group_by(day_of_week, month, day_night) %>%
summarize(mean_attend = mean(attend),
total_attend = sum(attend), .groups = "drop")
sum_attend %>%
ggplot(aes(day_of_week, month, month)) +
geom_jitter(aes(size = mean_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
scale_size(labels = scales::comma) +
labs(title = "Average attendance", size = "attendance", col = "part of day",
x = "Weekday", y = "Month")
sum_attend %>%
ggplot(aes(day_of_week, month)) +
geom_jitter(aes(size = total_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
labs(title = "Total attendance", size = "attendance", col = "part of day",
x = "Weekday", y = "Month") +
scale_size(labels = scales::comma) +
guides(col = guide_legend(order = 1),
shape = guide_legend(order = 2))
Figure 4: Average and total attendances on each weekday and month in each part of day
day_of_week
and month
factors. If necessary, put them in the logical order.levels(events$day_of_week)
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
levels(events$month)
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
events %>%
count(day_of_week, bobblehead) %>%
pivot_wider(names_from = bobblehead, values_from = n) %>%
replace_na(list(YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(Bobblehead = YES)
## # A tibble: 7 x 3
## day_of_week Bobblehead Total
## <fct> <dbl> <dbl>
## 1 Monday 0 12
## 2 Tuesday 6 13
## 3 Wednesday 0 12
## 4 Thursday 2 5
## 5 Friday 0 13
## 6 Saturday 2 13
## 7 Sunday 1 13
events %>%
ggplot(aes(day_of_week, attend)) +
geom_boxplot()
events %>%
slice_max(order_by = attend, n=5)
## # A tibble: 5 x 12
## month day attend day_of_week opponent temp skies day_night cap shirt
## <fct> <dbl> <dbl> <fct> <fct> <dbl> <fct> <fct> <fct> <fct>
## 1 APR 10 56000 Tuesday Pirates 19 Clear Day NO NO
## 2 AUG 21 56000 Tuesday Giants 24 Clear Night NO NO
## 3 JUL 1 55359 Sunday Mets 24 Clear Night NO NO
## 4 JUN 12 55279 Tuesday Angels 19 Clou… Night NO NO
## 5 AUG 7 55024 Tuesday Rockies 27 Clear Night NO NO
## # … with 2 more variables: fireworks <fct>, bobblehead <fct>
events %>%
ggplot(aes(day_night, attend)) +
geom_boxplot()
t.test(x=events$attend[events$day_night=="Day"],
y=events$attend[events$day_night=="Night"])
##
## Welch Two Sample t-test
##
## data: events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
## t = 0.42851, df = 23.62, p-value = 0.6722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3531.652 5380.397
## sample estimates:
## mean of x mean of y
## 41793.27 40868.89
Since p-value (0.67) is large (greater than 0.05), we cannot reject null, which means there is no statistical difference between average attendance of games played in day and night.
events %>%
ggplot(aes(skies, attend)) +
geom_boxplot()
t.test(x=events$attend[events$skies=="Clear"],
y=events$attend[events$skies=="Cloudy"])
##
## Welch Two Sample t-test
##
## data: events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
## t = 1.2868, df = 27.664, p-value = 0.2088
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1741.315 7617.103
## sample estimates:
## mean of x mean of y
## 41729.21 38791.32
We do not see a statisticall significant difference between the average attendance of the games played under clear and cloudy skies.
events %>%
ggplot(aes(temp, attend)) +
geom_jitter() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 23)^+ + \varepsilon_i \]
lm(attend ~ temp + pmax(0, temp - 23), data = events) %>% summary()
##
## Call:
## lm(formula = attend ~ temp + pmax(0, temp - 23), data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17115.9 -5194.3 422.1 4789.0 15982.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15253.3 7363.9 2.071 0.041631 *
## temp 1303.4 360.2 3.618 0.000525 ***
## pmax(0, temp - 23) -2240.1 612.7 -3.656 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7727 on 78 degrees of freedom
## Multiple R-squared: 0.1544, Adjusted R-squared: 0.1327
## F-statistic: 7.12 on 2 and 78 DF, p-value: 0.001445
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp-23)^+ + \varepsilon_i \]
events %>%
ggplot(aes(temp, attend)) +
geom_jitter() +
geom_smooth(se = FALSE) +
geom_smooth(se = FALSE, method = "lm",
formula = y ~ x + pmax(x-23,0), col = "red")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There is statistically significant relation between attendance and temperature.
Regress attendance on month
, day of the week
, and bobblehead
promotion.
lmod1 <- lm(attend ~ month + day_of_week + bobblehead, data = events)
lmod1 %>% summary()
##
## Call:
## lm(formula = attend ~ month + day_of_week + bobblehead, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33909.16 2521.81 13.446 < 0.0000000000000002 ***
## monthMAY -2385.62 2291.22 -1.041 0.30152
## monthJUN 7163.23 2732.72 2.621 0.01083 *
## monthJUL 2849.83 2578.60 1.105 0.27303
## monthAUG 2377.92 2402.91 0.990 0.32593
## monthSEP 29.03 2521.25 0.012 0.99085
## monthOCT -662.67 4046.45 -0.164 0.87041
## day_of_weekTuesday 7911.49 2702.21 2.928 0.00466 **
## day_of_weekWednesday 2460.02 2514.03 0.979 0.33134
## day_of_weekThursday 775.36 3486.15 0.222 0.82467
## day_of_weekFriday 4883.82 2504.65 1.950 0.05537 .
## day_of_weekSaturday 6372.06 2552.08 2.497 0.01500 *
## day_of_weekSunday 6724.00 2506.72 2.682 0.00920 **
## bobbleheadYES 10714.90 2419.52 4.429 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 0.0000002083
Check F-statistic’s p-value. If it is less than 0.05, then there is relation between attendance and predictors.
Does the bobblehead promotion have a statistically significant effect on the attendance?
Do month and day of week variables help to explain the number of attendants?
How many fans are expected to be drawn alone by a bobblehead promotion to a home game? Give a 90% confidence interval.
How good does the model fit to the data? Why? Comment on residual standard error and R\(^2\). Plot observed attendance against predicted attendance.
Predict the number of attendees to a typical home game on a Wednesday in June if a bobblehead promotion is extended. Give a 90% prediction interval.
Include all variables and conduct a full regression analysis of the problem. Submit your R markdown
and html
files to course homepage on moodle.