Week 5: Advertising and Promotion


The Dodgers is a professional baseball team and plays in the Major Baseball League. The team owns a 56,000-seat stadium and is interested in increasing the attendance of their fans during home games.At the moment the team management would like to know if bobblehead promotions increase the attendance of the team’s fans? This is a case study based on Miller (2014 Chapter 2).

include_graphics(c("los_angeles-dodgers-stadium.jpg",
                 "Los-Angeles-Dodgers-Promo.jpg",
                 "adrian_bobble.jpg"))
56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)56,000-seat Dodgers stadium (left),   shirts and caps (middle),  bobblehead (right)

Figure 1: 56,000-seat Dodgers stadium (left), shirts and caps (middle), bobblehead (right)

The 2012 season data in the events table of SQLite database data/dodgers.sqlite contain for each of 81 home play the

Prerequisites

We will use R, RStudio, R Markdown for the next three weeks to fit statistical models to various data and analyze them. Read Wickham and Grolemund (2017) online

All materials for the next three weeks will be available on Google drive.

March 1: Exploratory data analysis

  1. Connect to data/dodgers.sqlite. Read table events into a variable in R.

    • Read Baumer, Kaplan, and Horton (2017, Chapters 1, 4, 5, 15) (Second edition online) for getting data from and writing them to various SQL databases.

    • Because we do not want to hassle with user permissions, we will use SQLite for practice. I recommend PostgreSQL for real projects.

    • Open RStudio terminal, connect to database dodgers.sqlite with sqlite3. Explore it (there is only one table, events, at this time) with commands

      • .help
      • .databases
      • .tables
      • .schema <table_name>
      • .headers on
      • .mode column
      • SELECT ...
      • .quit
    • Databases are great to store and retrieve large data, especially, when they are indexed with respect to variables/columns along with we do search and match extensively.

    • R (likewise, Python) allows one to seeminglessly read from and write to databases. For fast analysis, keep data in a database, index tables for fast retrieval, use R or Python to fit models to data.

# Ctrl-shift-i
#library(RPostgreSQL)
library(RSQLite)  ## if package is not on the computer, then install it only once using Tools > Install packages...
con <- dbConnect(SQLite(), "../data/dodgers.sqlite") # read Modern Data Science with R for different ways to connect a database.

## dbListTables(con)

tbl(con, "events") %>% 
  collect() %>% 
  mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
         month = factor(month, levels = c("APR","MAY","JUN","JUL","AUG","SEP","OCT"))) %>% 
  mutate_if(is.character, factor) %>% 
  mutate(temp = round((temp- 32)*5/9)) -> events
  
# events %>% distinct(month)
# events$day_of_week %>% class()
# events$day_of_week %>% levels()
# events
# summary(events)
events %>% 
  count(bobblehead, fireworks)
## # A tibble: 3 x 3
##   bobblehead fireworks     n
##   <fct>      <fct>     <int>
## 1 NO         NO           56
## 2 NO         YES          14
## 3 YES        NO           11
  1. What are the number of plays on each week day and in each month of a year?

Table 1 and 2 summarize the number of games played on each weekday and month.

events %>% 
  count(day_of_week, month) %>% 
  pivot_wider(names_from = day_of_week, values_from = n) %>% 
  pander(caption = "(\\#tab:monthweekday) Number of games played in each weekday and month")
Table 1: Number of games played in each weekday and month
month Monday Tuesday Wednesday Thursday Friday Saturday Sunday
APR 1 2 2 1 2 2 2
MAY 3 3 2 1 3 3 3
JUN 1 1 1 1 2 2 1
JUL 3 3 2 NA 1 1 2
AUG 2 2 3 1 3 2 2
SEP 1 1 1 1 2 3 3
OCT 1 1 1 NA NA NA NA
events %>% 
  ggplot(aes(day_of_week)) +
  geom_bar(aes(fill=month))
Barplot of counts of games for each weekday and month

Figure 2: Barplot of counts of games for each weekday and month

Figure 3 shows your friend’s (very good) suggestion of headmap of total attendance versus weekday and month. The colors chabge from bright yellow to dark red as attendance increases. Default heatmap shuffles rows and columns so as to bring together weekdays and months with similar attendance. Here we see May, Aug, and Jul together within the months and Saturday, Friday, Sunday within the weekdays. Learn more about xtabs (cross-table) heatmap by typing ?xtabs and ?heatmap in the R console.

xtabs(attend ~ day_of_week + month, events) %>% 
  heatmap()
Heatmap of attendance versus weekday and month.

Figure 3: Heatmap of attendance versus weekday and month.

In Figure 4, I have added one more aes (colour) to capture day_night information. To avoid overplotting, I replaced geom_point() with geom_jitter(). These plots were also illuminating. So let us thank your friend who suggested this one, too.

sum_attend <- events %>% 
  group_by(day_of_week, month, day_night) %>% 
  summarize(mean_attend = mean(attend),
            total_attend = sum(attend), .groups = "drop")

sum_attend %>% 
  ggplot(aes(day_of_week, month, month)) +
  geom_jitter(aes(size = mean_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
  scale_size(labels = scales::comma) +
  labs(title = "Average attendance", size = "attendance", col = "part of day",
       x = "Weekday", y = "Month")

sum_attend %>% 
  ggplot(aes(day_of_week, month)) +
  geom_jitter(aes(size = total_attend, col = day_night), width = .1, height = .1, alpha=0.7) +
  labs(title = "Total attendance", size = "attendance", col = "part of day",
       x = "Weekday", y = "Month") +
  scale_size(labels = scales::comma) +
  guides(col = guide_legend(order = 1), 
         shape = guide_legend(order = 2))
Average and total attendances on each weekday and month in each part of dayAverage and total attendances on each weekday and month in each part of day

Figure 4: Average and total attendances on each weekday and month in each part of day

  1. Check the orders of the levels of the day_of_week and month factors. If necessary, put them in the logical order.
levels(events$day_of_week)
## [1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
## [7] "Sunday"
levels(events$month)
## [1] "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT"
  1. How many times were bobblehead promotions run on each week day?
events %>% 
  count(day_of_week, bobblehead) %>% 
  pivot_wider(names_from = bobblehead, values_from = n) %>% 
  replace_na(list(YES = 0)) %>% 
  mutate(Total = YES + NO) %>% 
  select(-NO) %>% 
  rename(Bobblehead = YES)
## # A tibble: 7 x 3
##   day_of_week Bobblehead Total
##   <fct>            <dbl> <dbl>
## 1 Monday               0    12
## 2 Tuesday              6    13
## 3 Wednesday            0    12
## 4 Thursday             2     5
## 5 Friday               0    13
## 6 Saturday             2    13
## 7 Sunday               1    13
  1. How did the attendance vary across week days? Draw boxplots. On which day of week was the attendance the highest on average?
events %>% 
  ggplot(aes(day_of_week, attend)) +
  geom_boxplot()

events %>% 
  slice_max(order_by = attend, n=5)
## # A tibble: 5 x 12
##   month   day attend day_of_week opponent  temp skies day_night cap   shirt
##   <fct> <dbl>  <dbl> <fct>       <fct>    <dbl> <fct> <fct>     <fct> <fct>
## 1 APR      10  56000 Tuesday     Pirates     19 Clear Day       NO    NO   
## 2 AUG      21  56000 Tuesday     Giants      24 Clear Night     NO    NO   
## 3 JUL       1  55359 Sunday      Mets        24 Clear Night     NO    NO   
## 4 JUN      12  55279 Tuesday     Angels      19 Clou… Night     NO    NO   
## 5 AUG       7  55024 Tuesday     Rockies     27 Clear Night     NO    NO   
## # … with 2 more variables: fireworks <fct>, bobblehead <fct>
  1. Is there an association between attendance and
    • whether the game is played in day light or night?
    • Between attendance and whether skies are clear or cloudy?
events %>% 
  ggplot(aes(day_night, attend)) +
  geom_boxplot()

t.test(x=events$attend[events$day_night=="Day"],
       y=events$attend[events$day_night=="Night"])
## 
##  Welch Two Sample t-test
## 
## data:  events$attend[events$day_night == "Day"] and events$attend[events$day_night == "Night"]
## t = 0.42851, df = 23.62, p-value = 0.6722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3531.652  5380.397
## sample estimates:
## mean of x mean of y 
##  41793.27  40868.89

Since p-value (0.67) is large (greater than 0.05), we cannot reject null, which means there is no statistical difference between average attendance of games played in day and night.

events %>% 
  ggplot(aes(skies, attend)) +
  geom_boxplot()

t.test(x=events$attend[events$skies=="Clear"],
       y=events$attend[events$skies=="Cloudy"])
## 
##  Welch Two Sample t-test
## 
## data:  events$attend[events$skies == "Clear"] and events$attend[events$skies == "Cloudy"]
## t = 1.2868, df = 27.664, p-value = 0.2088
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1741.315  7617.103
## sample estimates:
## mean of x mean of y 
##  41729.21  38791.32

We do not see a statisticall significant difference between the average attendance of the games played under clear and cloudy skies.

  1. Is there an association between attendance and temperature?
    • If yes, is there a positive or negative association?
    • Do the associations differ on clear and cloud days or day or night times?
events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp - 23)^+ + \varepsilon_i \]

lm(attend ~ temp + pmax(0, temp - 23), data = events) %>% summary()
## 
## Call:
## lm(formula = attend ~ temp + pmax(0, temp - 23), data = events)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17115.9  -5194.3    422.1   4789.0  15982.0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         15253.3     7363.9   2.071 0.041631 *  
## temp                 1303.4      360.2   3.618 0.000525 ***
## pmax(0, temp - 23)  -2240.1      612.7  -3.656 0.000463 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7727 on 78 degrees of freedom
## Multiple R-squared:  0.1544, Adjusted R-squared:  0.1327 
## F-statistic:  7.12 on 2 and 78 DF,  p-value: 0.001445

\[ attend = \beta_0 + \beta_1 temp + \beta_2 (temp-23)^+ + \varepsilon_i \]

events %>% 
  ggplot(aes(temp, attend)) +
  geom_jitter() +
  geom_smooth(se = FALSE) +
  geom_smooth(se = FALSE, method = "lm", 
              formula = y ~ x + pmax(x-23,0), col = "red") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There is statistically significant relation between attendance and temperature.

Next time: A linear regression model

Regress attendance on month, day of the week, and bobblehead promotion.

lmod1 <- lm(attend ~ month + day_of_week + bobblehead, data = events)
  1. Is there any evidence for a relationship between attendance and other variables? Why or why not?
lmod1 %>% summary()
## 
## Call:
## lm(formula = attend ~ month + day_of_week + bobblehead, data = events)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10786.5  -3628.1   -516.1   2230.2  14351.0 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          33909.16    2521.81  13.446 < 0.0000000000000002 ***
## monthMAY             -2385.62    2291.22  -1.041              0.30152    
## monthJUN              7163.23    2732.72   2.621              0.01083 *  
## monthJUL              2849.83    2578.60   1.105              0.27303    
## monthAUG              2377.92    2402.91   0.990              0.32593    
## monthSEP                29.03    2521.25   0.012              0.99085    
## monthOCT              -662.67    4046.45  -0.164              0.87041    
## day_of_weekTuesday    7911.49    2702.21   2.928              0.00466 ** 
## day_of_weekWednesday  2460.02    2514.03   0.979              0.33134    
## day_of_weekThursday    775.36    3486.15   0.222              0.82467    
## day_of_weekFriday     4883.82    2504.65   1.950              0.05537 .  
## day_of_weekSaturday   6372.06    2552.08   2.497              0.01500 *  
## day_of_weekSunday     6724.00    2506.72   2.682              0.00920 ** 
## bobbleheadYES        10714.90    2419.52   4.429            0.0000359 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6120 on 67 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.456 
## F-statistic: 6.158 on 13 and 67 DF,  p-value: 0.0000002083

Check F-statistic’s p-value. If it is less than 0.05, then there is relation between attendance and predictors.

  1. Does the bobblehead promotion have a statistically significant effect on the attendance?

  2. Do month and day of week variables help to explain the number of attendants?

  3. How many fans are expected to be drawn alone by a bobblehead promotion to a home game? Give a 90% confidence interval.

  4. How good does the model fit to the data? Why? Comment on residual standard error and R\(^2\). Plot observed attendance against predicted attendance.

  5. Predict the number of attendees to a typical home game on a Wednesday in June if a bobblehead promotion is extended. Give a 90% prediction interval.

Project (will be graded)

Include all variables and conduct a full regression analysis of the problem. Submit your R markdown and html files to course homepage on moodle.

Bibliography

Baumer, B. S., D. T. Kaplan, and N. J. Horton. 2017. Modern Data Science with R. Chapman & Hall/CRC Texts in Statistical Science. CRC Press. https://books.google.com.tr/books?id=NrddDgAAQBAJ.
Miller, T. W. 2014. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science. FT Press Analytics. Pearson Education. https://books.google.com.tr/books?id=PU6nBAAAQBAJ.
Wickham, H., and G. Grolemund. 2017. R for Data Science. O’Reilly Media. https://books.google.com.tr/books?id=aZRYrgEACAAJ.