STATS 330 Assignment1

Question 1

Part 1

bikeData <- read.csv("bike_df.csv")

bikeData <- mutate(bikeData, day_type2 = ifelse(day_type == "Weekday", 0, 1))

Part 2

plot(log(bike_count) ~ as.numeric(as.Date(date)), data = bikeData,
     main = "Bike Count by Day (2023-2024)", 
     xlab = "Date", ylab = "Bike Count")

boxplot(log(bike_count) ~ day_type, data = bikeData,
        main = "Bike Count on Weekdays vs Saturday vs Sunday",
        xlab = "Day Type", ylab = "Bike Count")

boxplot(log(bike_count) ~ day_type2, data = bikeData,
        main = "Bike Count on Weekdays vs Weekends",
        xlab = "Day Type (0 = Weekday, 1 = Weekend)", ylab = "Bike Count")

boxplot(log(bike_count) ~ restr_day, data = bikeData,
        main = "Bike Count on Non-restricted vs Restricted car days",
        xlab = "Restriction (0 = Non-restricted, 1 = Restricted)", ylab = "Bike Count")

boxplot(log(bike_count) ~ sunny, data = bikeData,
        main = "Bike Count on Cloudy vs Sunny days",
        xlab = "Weather (0 = Cloudy, 1 = Sunny)", ylab = "Bike Count")

Part 3

Plots examining interaction between explanitory variables:

bikeGLM_inter1 <- glm(bike_count~restr_day*sunny, family = poisson, data = bikeData)
interact_plot(bikeGLM_inter1, pred = restr_day, modx = sunny)

bikeGLM_inter2 <- glm(bike_count~restr_day*day_type, family = poisson, data = bikeData)
interact_plot(bikeGLM_inter2, pred = restr_day, modx = day_type)

bikeGLM_inter3 <- glm(bike_count~sunny*day_type, family = poisson, data = bikeData)
interact_plot(bikeGLM_inter3, pred = sunny, modx = day_type)

Code for initial and final models, along with final model checks and coefficient back-transformations:

bikeData$day_type <- factor(bikeData$day_type)
bikeData$day_type <- relevel(bikeData$day_type, ref = "Weekday")

#Initial model
bikeGLM_initial <- glm(bike_count~(as.numeric(as.Date(date))+day_type2+restr_day+sunny)^2, family = poisson, data = bikeData)
summary(bikeGLM_initial)

## 
## Call:
## glm(formula = bike_count ~ (as.numeric(as.Date(date)) + day_type2 + 
##     restr_day + sunny)^2, family = poisson, data = bikeData)
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                          3.744e-01  1.155e+00   0.324 0.745886    
## as.numeric(as.Date(date))            1.322e-04  5.857e-05   2.257 0.023978 *  
## day_type2                            3.775e+00  9.424e-01   4.006 6.18e-05 ***
## restr_day                            3.412e+00  9.372e-01   3.641 0.000272 ***
## sunny                                1.154e+00  1.147e+00   1.006 0.314190    
## as.numeric(as.Date(date)):day_type2 -1.646e-04  4.775e-05  -3.448 0.000565 ***
## as.numeric(as.Date(date)):restr_day -1.360e-04  4.750e-05  -2.864 0.004188 ** 
## as.numeric(as.Date(date)):sunny     -3.266e-05  5.811e-05  -0.562 0.574078    
## day_type2:restr_day                 -7.605e-02  2.035e-02  -3.737 0.000186 ***
## day_type2:sunny                      7.510e-01  2.483e-02  30.253  < 2e-16 ***
## restr_day:sunny                     -2.229e-02  2.469e-02  -0.903 0.366601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 27076.4  on 729  degrees of freedom
## Residual deviance:  3695.7  on 719  degrees of freedom
## AIC: 7801.1
## 
## Number of Fisher Scoring iterations: 4

#ANOVA of "day_type"
bikeGLM_day_type <- glm(bike_count~day_type, family = poisson, data = bikeData)
anova(bikeGLM_day_type, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: bike_count
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       729      27076              
## day_type  2    14575       727      12501 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Final model
bikeGLM_final <- glm(bike_count~day_type+restr_day+sunny, family = quasipoisson, data = bikeData)
summary(bikeGLM_final)

## 
## Call:
## glm(formula = bike_count ~ day_type + restr_day + sunny, family = quasipoisson, 
##     data = bikeData)
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.69496    0.02170  124.21   <2e-16 ***
## day_typeSaturday  0.68788    0.02254   30.52   <2e-16 ***
## day_typeSunday    1.39104    0.01842   75.52   <2e-16 ***
## restr_day         0.69970    0.01665   42.01   <2e-16 ***
## sunny             0.89909    0.02029   44.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.800424)
## 
##     Null deviance: 27076.4  on 729  degrees of freedom
## Residual deviance:  2015.4  on 725  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

modelcheck(bikeGLM_final)

round(exp(confint(bikeGLM_final)), 2)

## Waiting for profiling to be done...

##                  2.5 % 97.5 %
## (Intercept)      14.19  15.45
## day_typeSaturday  1.90   2.08
## day_typeSunday    3.88   4.17
## restr_day         1.95   2.08
## sunny             2.36   2.56

I began by predicting “bike_count” using the main effects and all two-way interactions between explanatory variables, using “day_type2” instead of “day_type”. I also re-leveled “day_type” to use weekdays as the baseline because I though this seemed more intuitive.

I found multiple of the interaction terms to be significant (day_type2:restr_day, day_type2:sunny, as.numeric(as.Date(date)):day_type2, and as.numeric(as.Date(date)):restr_day). However, we know that restricted days are randomly chosen and sunny days will fall on random days of the week, as well as weekdays/weekends not changing their frequency based on the date, so I removed these interaction terms because they are not logical. After doing so, I found the main effect of “date” to be insignificant, and so removed this too.

I then swapped “day_type2” for “day_type” and found a significant increase in the variation explained which prompted me to keep this in my final model. This was further supported by an ANOVA of “day_type” that showed a significant difference between levels.

Initially, I used a poisson distribution because “bike_count” is a count variable, but switched to quasipoisson instead due to overdispersion (residual deviance of 2015 on 725 degrees of freedom). I then did a model check and found the QQ plot, histogram of residuals, and Cook’s Distances to all appear reasonable.

My final model equation is as follows: \[log(𝜇_i) = \beta_0 + \beta_1×DayType_i + \beta_2\times Restricted_i + \beta_3 \times Sunny_i\] We have evidence of significant main effects from “day_type”, “restr_day”, and “sunny” on the total count of bikes each day. Using cloudy weekdays with no car restrictions as the baseline, we estimate with 95% confidence that the baseline average is between 14.19 and 15.45 bikes per day. Compared to this, our confidence intervals show average increases of:

Between 190% and 208% on Saturdays
Between 388% and 417% on Sundays
Between 195% and 208% on restricted car days
Between 236% and 256% on sunny days

Question 2

For both plots, my first suggestion would be to log “bikecount” as it is a count variable and clearly right skewed. Next, I would suggest to add a theme such as “theme_minimal()” to make it look a bit more professional. Finally, for the second plot, I would examine the relationships between “bikecount~restrday”, and “bikecount~sunny” separately. This will provide a more clear understanding of each, and because the car restriction days are randomly chosen, we can infer that any interaction between weather and restricted days will be purely circumstantial.

Question 3C

The first issue I notice with the model fit is the distribution. The AI is using a guassian distribution, but we are working with count data so a poisson or quasipoisson will be more appropriate. Next, I see that it is examining an interaction between “sunny” and “restrday”, but we know that restricted car days are randomly chosen, so there shouldn’t be any interaction between this variable and the others. I would suggest to use a poisson or quasipoisson distribution and only the main effects of each variable.

Question 3D

The issues with this output mostly stem from the initial model setup. I believe a poisson or quasipoisson distribution would be more appropriate and the interaction between “sunny” and “restrday” should be removed. The AI has also modeled the equation using Y and instead of 𝜇 and an error term at the end. In this context it would be more appropriate to use log(𝜇) instead of Y and to not include the error term because we are fitting a line to sample count data.

Question 3E

Again, most of the problems in this output stem from the initial setup. Given a normal distribution, the interpretation of coefficients would be correct. This is far from the case, however, since a normal distribution is not appropriate. Ideally, a poisson or quasipoisson distribution would be used, the coefficients would represent proportional changes in Y, and the interaction term would be removed.