The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games.At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014 Chapter 2).
include_graphics(c("los_angeles-dodgers-stadium.jpg",
"Los-Angeles-Dodgers-Promo.jpg",
"adrian_bobble.jpg"))
Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)
The 2012 season data in the events
table of SQLite database data/dodgers.sqlite
contain for each of 81 home play the
We will use R
, RStudio
, R Markdown
for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online
R
and RStudio
,R Markdown
to interact with R
and conduct various predictive analyses.All materials for the next three weeks will be available on Google drive.
Connect to data/dodgers.sqlite
. Read table events
into a variable in R
.
Read Baumer, Kaplan, and Horton (2017, Chapters 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.
Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL
for real projects.
Open RStudio
terminal, connect to database dodgers.sqlite
with sqlite3
. Explore it (there is only one table, events
, at this time) with commands
.help
.databases
.tables
.schema <table_name>
.headers on
.mode column
SELECT ...
.quit
Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.
R
(likewise, Python
) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R
or Python
to fit models to data.
# Ctrl-shift-i
#library(RPostgreSQL)
library(RSQLite) ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite") # read Modern Data Science with R for different ways to connect a database.
## dbListTables(con)
tbl(con, "events") %>%
collect() %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>%
mutate_if(is.character, factor) %>%
mutate(temp = round((temp- 32)*5/9)) -> events
# events %>% distinct(month)
# events$day_of_week %>% class()
# events$day_of_week %>% levels()
# events
# summary(events)
events %>%
count(bobblehead, fireworks)
## # A tibble: 3 x 3
## bobblehead fireworks n
## <fct> <fct> <int>
## 1 NO NO 56
## 2 NO YES 14
## 3 YES NO 11
Table 1 and 2 summarize the number of games played on each weekday and month.
events %>%
count(day_of_week, month) %>%
pivot_wider(names_from = day_of_week, values_from = n) %>%
pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")
month | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---|---|---|---|---|---|---|---|
APR | 1 | 2 | 2 | 1 | 2 | 2 | 2 |
MAY | 3 | 3 | 2 | 1 | 3 | 3 | 3 |
JUN | 1 | 1 | 1 | 1 | 2 | 2 | 1 |
JUL | 3 | 3 | 2 | NA | 1 | 1 | 2 |
AUG | 2 | 2 | 3 | 1 | 3 | 2 | 2 |
SEP | 1 | 1 | 1 | 1 | 2 | 3 | 3 |
OCT | 1 | 1 | 1 | NA | NA | NA | NA |
events %>%
ggplot(aes(day_of_week)) +
geom_bar(aes(fill=month))
Figure 2: Barplot of counts of games for each weekday and month
Figure 3 shows your friend’s (very good) suggestion of headmap of total attendance versus weekday and month. The colors chabge from bright yellow to dark red as attendance increases. Default heatmap shuffles rows and columns so as to bring together weekdays and months with similar attendance. Here we see May, Aug, and Jul together within the months and Saturday, Friday, Sunday within the weekdays. Learn more about xtabs (cross-table) heatmap by typing ?xtabs
and ?heatmap
in the R console.
xtabs(attend ~ day_of_week + month, events) %>%
heatmap()
Figure 3: Heatmap of attendance versus weekday and month.
In Figure 4, I have added one more aes (colour) to capture day_night information. To avoid overplotting, I replaced geom_point()
with geom_jitter()
. These plots were also illuminating. So let us thank your friend who suggested this one, too.
sum_attend <- events %>%
group_by(day_of_week, month, day_night) %>%
summarize(mean_attend = mean(attend),
total_attend = sum(attend), .groups = "drop")
sum_attend %>%
ggplot(aes(day_of_week, month, month)) +
geom_jitter(aes(size = mean_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
scale_size(labels = scales::comma) +
labs(title = "Average attendance", size = "attendance", col = "part of day",
x = "Weekday", y = "Month")
sum_attend %>%
ggplot(aes(day_of_week, month)) +
geom_jitter(aes(size = total_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
labs(title = "Total attendance", size = "attendance", col = "part of day",
x = "Weekday", y = "Month") +
scale_size(labels = scales::comma) +
guides(col = guide_legend(order = 1),
shape = guide_legend(order = 2))
Figure 4: Average and total attendances on each weekday and month in each part of day
day_of_week
and month
factors. If necessary, put them in the logical order.levels(events$day_of_week)
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
levels(events$month)
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
events %>%
count(day_of_week, bobblehead) %>%
pivot_wider(names_from = bobblehead, values_from = n) %>%
replace_na(list(YES = 0)) %>%
mutate(Total = YES + NO) %>%
select(-NO) %>%
rename(Bobblehead = YES)
## # A tibble: 7 x 3
## day_of_week Bobblehead Total
## <fct> <dbl> <dbl>
## 1 Monday 0 12
## 2 Tuesday 6 13
## 3 Wednesday 0 12
## 4 Thursday 2 5
## 5 Friday 0 13
## 6 Saturday 2 13
## 7 Sunday 1 13
events %>%
ggplot(aes(day_of_week, attend)) +
geom_boxplot()
events %>%
slice_max(order_by = attend, n=5)
## # A tibble: 5 x 12
## month day attend day_of_week opponent temp skies day_night cap shirt
## <fct> <dbl> <dbl> <fct> <fct> <dbl> <fct> <fct> <fct> <fct>
## 1 APR 10 56000 Tuesday Pirates 19 Clear Day NO NO
## 2 AUG 21 56000 Tuesday Giants 24 Clear Night NO NO
## 3 JUL 1 55359 Sunday Mets 24 Clear Night NO NO
## 4 JUN 12 55279 Tuesday Angels 19 Clou… Night NO NO
## 5 AUG 7 55024 Tuesday Rockies 27 Clear Night NO NO
## # … with 2 more variables: fireworks <fct>, bobblehead <fct>
events %>%
ggplot(aes(day_night, attend)) +
geom_boxplot()
t.test(x=events$attend[events$day_night=="Day"],
y=events$attend[events$day_night=="Night"])
##
## Welch Two Sample t-test
##
## data: events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
## t = 0.42851, df = 23.62, p-value = 0.6722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3531.652 5380.397
## sample estimates:
## mean of x mean of y
## 41793.27 40868.89
Since p-value (0.67) is large (greater than 0.05), we cannot reject null, which means there is no statistical difference between average attendance of games played in day and night.
events %>%
ggplot(aes(skies, attend)) +
geom_boxplot()
t.test(x=events$attend[events$skies=="Clear"],
y=events$attend[events$skies=="Cloudy"])
##
## Welch Two Sample t-test
##
## data: events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
## t = 1.2868, df = 27.664, p-value = 0.2088
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1741.315 7617.103
## sample estimates:
## mean of x mean of y
## 41729.21 38791.32
We do not see a statisticall significant difference between the average attendance of the games played under clear and cloudy skies.
events %>%
ggplot(aes(temp, attend)) +
geom_jitter() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 23)^+ + \varepsilon_i \]
lm(attend ~ temp + pmax(0, temp - 23), data = events) %>% summary()
##
## Call:
## lm(formula = attend ~ temp + pmax(0, temp - 23), data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17115.9 -5194.3 422.1 4789.0 15982.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15253.3 7363.9 2.071 0.041631 *
## temp 1303.4 360.2 3.618 0.000525 ***
## pmax(0, temp - 23) -2240.1 612.7 -3.656 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7727 on 78 degrees of freedom
## Multiple R-squared: 0.1544, Adjusted R-squared: 0.1327
## F-statistic: 7.12 on 2 and 78 DF, p-value: 0.001445
\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp-23)^+ + \varepsilon_i \]
events %>%
ggplot(aes(temp, attend)) +
geom_jitter() +
geom_smooth(se = FALSE) +
geom_smooth(se = FALSE, method = "lm",
formula = y ~ x + pmax(x-23,0), col = "red")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There is statistically significant relation between attendance and temperature.
Regress attendance on month
, day of the week
, and bobblehead
promotion.
lmod1 <- lm(attend ~ month + day_of_week + bobblehead, data = events)
events$month %>% levels()
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
events2 <- events %>%
mutate(month = relevel(month, "JUN"),
day_of_week = relevel(day_of_week, "Sunday"))
lm(attend ~ month + day_of_week + bobblehead, data = events2) %>% summary()
##
## Call:
## lm(formula = attend ~ month + day_of_week + bobblehead, data = events2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47796.4 2671.9 17.889 < 0.0000000000000002 ***
## monthAPR -7163.2 2732.7 -2.621 0.010830 *
## monthMAY -9548.9 2526.9 -3.779 0.000337 ***
## monthJUL -4313.4 2767.9 -1.558 0.123862
## monthAUG -4785.3 2594.6 -1.844 0.069557 .
## monthSEP -7134.2 2763.5 -2.582 0.012029 *
## monthOCT -7825.9 4232.6 -1.849 0.068876 .
## day_of_weekMonday -6724.0 2506.7 -2.682 0.009197 **
## day_of_weekTuesday 1187.5 2594.7 0.458 0.648672
## day_of_weekWednesday -4264.0 2501.4 -1.705 0.092895 .
## day_of_weekThursday -5948.6 3339.3 -1.781 0.079380 .
## day_of_weekFriday -1840.2 2426.8 -0.758 0.450944
## day_of_weekSaturday -351.9 2417.6 -0.146 0.884691
## bobbleheadYES 10714.9 2419.5 4.429 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 0.0000002083
lmod1 %>% summary()
##
## Call:
## lm(formula = attend ~ month + day_of_week + bobblehead, data = events)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10786.5 -3628.1 -516.1 2230.2 14351.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33909.16 2521.81 13.446 < 0.0000000000000002 ***
## monthMAY -2385.62 2291.22 -1.041 0.30152
## monthJUN 7163.23 2732.72 2.621 0.01083 *
## monthJUL 2849.83 2578.60 1.105 0.27303
## monthAUG 2377.92 2402.91 0.990 0.32593
## monthSEP 29.03 2521.25 0.012 0.99085
## monthOCT -662.67 4046.45 -0.164 0.87041
## day_of_weekTuesday 7911.49 2702.21 2.928 0.00466 **
## day_of_weekWednesday 2460.02 2514.03 0.979 0.33134
## day_of_weekThursday 775.36 3486.15 0.222 0.82467
## day_of_weekFriday 4883.82 2504.65 1.950 0.05537 .
## day_of_weekSaturday 6372.06 2552.08 2.497 0.01500 *
## day_of_weekSunday 6724.00 2506.72 2.682 0.00920 **
## bobbleheadYES 10714.90 2419.52 4.429 0.0000359 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.456
## F-statistic: 6.158 on 13 and 67 DF, p-value: 0.0000002083
Check F-statistic’s p-value. If it is less than 0.05, then there is relation between attendance and predictors.
bobblehead
promotion have a statistically significant effect on the attendance?Test \(H_0: \beta_\text{BobbleheadYES} = 0\). Under \(H_0\), t-stat in the summary has t-distribution with degrees of freedom equal to (number of samples - numbr of parameters estimated). We check directly p-value for the t-test. If p-value is small (< 0.05), then we reject the null hypothesis and conclude that BobbleHead is important in increasing the attendance in the games. Since p-value (0.0000359) is less than 5%, we reject the null. Therefore we conclude it is a good idea to use bobblehead to boos the number of fans coming to stadium to watch the game.
month
and day of week
variables help to explain the number of attendants?Is there a relation between month and attendance (after we account for the effects of day_of_week and bobblehead)?
# lmod2 <- lm(attend ~ skies, data = events)
# lmod2 %>% summary()
# lmod2 <- lm(attend ~month, data = events)
# lmod2 %>% summary()
small <- update(lmod1, . ~ . - month)
anova(small, lmod1)
## Analysis of Variance Table
##
## Model 1: attend ~ day_of_week + bobblehead
## Model 2: attend ~ month + day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 73 3129721926
## 2 67 2509574563 6 620147363 2.7594 0.01858 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H_0: the small model is correct. If p-value is small (for example, less than 5%), as always we reject the null hypothesis (in this case, null says that the small model is correct). Here, p-value = 01858 < 5%, so it is small. We reject the small model. Therefore, we conclude that month and attendance are related (while day_of_week and bobbleheadYES are still in the model).
How many fans are expected to be drawn alone by a bobblehead promotion to a home game? Give a 90% confidence interval.
How good does the model fit to the data? Why? Comment on residual standard error and R\(^2\). Plot observed attendance against predicted attendance.
Is day of week important? (Does day_of_week provide new explanation while the other predictors are still present in the model?)
small <- update(lmod1, . ~ . - day_of_week)
anova(small, lmod1)
## Analysis of Variance Table
##
## Model 1: attend ~ month + bobblehead
## Model 2: attend ~ month + day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 73 3085413762
## 2 67 2509574563 6 575839199 2.5623 0.02704 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the small model because p-value (0.02704) is small (less than 0.05). SO conclude that day of week still contributes important information to pur understanding of attandance while the others two predcitors are in the model.
Variable selection
smallest <- update(lmod1, . ~ . - day_of_week - month)
anova(smallest, small, lmod1)
## Analysis of Variance Table
##
## Model 1: attend ~ bobblehead
## Model 2: attend ~ month + bobblehead
## Model 3: attend ~ month + day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 3642937151
## 2 73 3085413762 6 557523389 2.4808 0.03157 *
## 3 67 2509574563 6 575839199 2.5623 0.02704 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
small
##
## Call:
## lm(formula = attend ~ month + bobblehead, data = events)
##
## Coefficients:
## (Intercept) monthMAY monthJUN monthJUL monthAUG
## 38519.6 -2603.6 6561.4 2147.8 1658.4
## monthSEP monthOCT bobbleheadYES
## 435.4 -1816.0 12867.3
small2 <- update(lmod1, . ~ . - month)
anova(smallest, small2, lmod1)
## Analysis of Variance Table
##
## Model 1: attend ~ bobblehead
## Model 2: attend ~ day_of_week + bobblehead
## Model 3: attend ~ month + day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 3642937151
## 2 73 3129721926 6 513215226 2.2836 0.04583 *
## 3 67 2509574563 6 620147363 2.7594 0.01858 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(smallest, lmod1)
## Analysis of Variance Table
##
## Model 1: attend ~ bobblehead
## Model 2: attend ~ month + day_of_week + bobblehead
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 3642937151
## 2 67 2509574563 12 1133362588 2.5215 0.008408 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is wise to compare all nested models pairwise with anova()
If p-value is slightly above 5%, we may call anova to be inconclusive. Use cross-validation to decide between two models:
We will use the full model because anova analysis showed that all predictors were important
predict(lmod1,
newdata = data.frame(
month = "JUN",
day_of_week = "Wednesday",
bobblehead = "YES"),
interval = "prediction",
level = 0.90)
## fit lwr upr
## 1 54247.32 42491.1 66003.55
Include all variables and conduct a full regression analysis of the problem. Submit your R markdown
and html
files to course homepage on moodle.