STATS 330 Assignment 3

lobData <- read.csv("lobster.csv")

Question 1

lobGrouped <- lobData %>%
  group_by(size) %>%
  summarise(
    y = sum(survived),
    n = n(),
    p = y / n
  ) %>%
  arrange(size)

print(lobGrouped)

## # A tibble: 11 × 4
##     size     y     n     p
##    <int> <int> <int> <dbl>
##  1    27     0     5 0    
##  2    30     1    10 0.1  
##  3    33     3    22 0.136
##  4    36     7    21 0.333
##  5    39    12    22 0.545
##  6    42    17    29 0.586
##  7    45    13    18 0.722
##  8    48    12    17 0.706
##  9    51     7     8 0.875
## 10    54     6     6 1    
## 11    57     1     1 1

Question 2

a)

plot(lobGrouped$size, lobGrouped$p,
     xlab = "Carapace size (mm)",
     ylab = "Proportion survived",
     main = "Size vs Probability of survival")

b)

We see a clear upward trend in the probability of survival as carapace size increases. This trend appears approximately linear.

Question 3

a)

lobFit <- glm(cbind(y, n - y) ~ size, data = lobGrouped, family = binomial)

dev_res <- residuals(lobFit, type = "deviance")
pear_res <- residuals(lobFit, type = "pearson")

fitted_vals <- fitted(lobFit)

plot(fitted_vals, dev_res,
     xlab = "Fitted values (survival probability)",
     ylab = "Deviance residuals",
     main = "Deviance residuals vs Fitted values",
     pch = 19, col = "red")
abline(h = 0, lty = 2)

plot(fitted_vals, pear_res,
     xlab = "Fitted values (survival probability)",
     ylab = "Pearson residuals",
     main = "Pearson residuals vs Fitted values",
     pch = 19, col = "blue")
abline(h = 0, lty = 2)

lobGrouped <- mutate(lobGrouped, fitted = fitted(lobFit), expectedSuccesses = round(n * fitted, 2), expectedFailures = round(n * (1 - fitted), 2))
tibble(Expected_successes = lobGrouped$expectedSuccesses, Expected_failures = lobGrouped$expectedFailures)

## # A tibble: 11 × 2
##    Expected_successes Expected_failures
##                 <dbl>             <dbl>
##  1               0.34              4.66
##  2               1.17              8.83
##  3               4.24             17.8 
##  4               6.31             14.7 
##  5               9.59             12.4 
##  6              16.9              12.1 
##  7              12.9               5.14
##  8              13.9               3.09
##  9               7.12              0.88
## 10               5.62              0.38
## 11               0.96              0.04

summary(lobFit)

## 
## Call:
## glm(formula = cbind(y, n - y) ~ size, family = binomial, data = lobGrouped)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.89597    1.38501  -5.701 1.19e-08 ***
## size         0.19586    0.03415   5.735 9.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52.1054  on 10  degrees of freedom
## Residual deviance:  4.5623  on  9  degrees of freedom
## AIC: 32.24
## 
## Number of Fisher Scoring iterations: 4

1 - pchisq(lobFit$deviance, lobFit$df.residual)

## [1] 0.8706732

The deviance residual plot and pearson residual plot appear almost identical. This is because although the residuals are calculated differently, they are describing the same data. In addition, the small sample sizes combined with limited variability means that neither residual type reveals any deeper structure beyond the other, so the plots end up looking essentially the same.

Our final model is: \[logit(𝜇_i) = \beta_0 + \beta_1×Size_i\] We have some groups with expected successes or failures less than five which means we have mild sparsity. The residuals appear to have approximately constant variance between -1 and 1 and a chi-square Goodness Of Fit test returns an insignificant value, telling us that we have no evidence that the residual deviance does not come from a chi-square distribution. Together this information allows us to determine the model suitable for representing the data, but it should be noted that the data is somewhat sparse.

b)

We have evidence of a strong positive relationship between carapace size and probability of survival. We can say with 95% confidence that a one mm increase in carapace size results in an increase between 13.8% and 30.1% in the odds of survival.

Question 4

a)

lobGAM <- gam(cbind(y, n - y) ~ s(size), data = lobGrouped, family = binomial)
plot(lobGAM)

Here we have fitted a GAM model and found that a linear relationship is most appropriate under a binomial model.

b)

authorModel <- glm(p ~ size, data = lobGrouped, family = gaussian)
summary(authorModel)

## 
## Call:
## glm(formula = p ~ size, family = gaussian, data = lobGrouped)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.948038   0.086867  -10.91 1.72e-06 ***
## size         0.035569   0.002017   17.63 2.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.004029348)
## 
##     Null deviance: 1.288767  on 10  degrees of freedom
## Residual deviance: 0.036264  on  9  degrees of freedom
## AIC: -25.646
## 
## Number of Fisher Scoring iterations: 2

c)

plot(lobGrouped$size, lobGrouped$p,
     xlab = "Carapace size (mm)",
     ylab = "Proportion survived",
     main = "Size vs Probabbility of survival")

new_size <- seq(min(lobGrouped$size), max(lobGrouped$size), length.out = 100)

linearProb <- predict(authorModel, newdata = data.frame(size = new_size), type = "response")
binomialProb <- predict(lobFit, newdata = data.frame(size = new_size), type = "response")

lines(new_size, linearProb, col = "red", lwd = 2)
lines(new_size, binomialProb, col = "darkgreen", lwd = 2)

legend("bottomright",
       legend = c("Author model", "My model"),
       col = c("red", "darkgreen"),
       lty = 2, lwd = 2)

1 - pchisq(lobFit$deviance, lobFit$df.residual)

## [1] 0.8706732

1 - pchisq(authorModel$deviance, authorModel$df.residual)

## [1] 1

From the plot alone, it is hard to determine which model fits the data better since they both fit very well. When further comparing the models, we see that the authors’ model has a lower residual deviance (0.036 compared to 4.56, both on 9 degrees of freedom) and a higher chi-square Goodness Of Fit statistic (1 compared to 0.87). This suggests that the authors’ model fits the data better, however, such a low residual deviance and high Goodness Of Fit could indicate that the model fits the data too well and needs to be further investigated.

Question 5

a)

The residual deviance of a binomial model is closely approximated by a chi-squared distribution when the model is correct, observations are independent, and expected successes and failures are at least five for each group. This holds for about half of the groups, meaning we have some sparsity, but not necessarily enough to make the data not interpretable.

b)

set.seed(486)
runs <- 1000
simDeviance <- numeric(runs)

for (i in 1:runs) {
  y.sim <- rbinom(n = nrow(lobGrouped), size = lobGrouped$n, prob = fitted(lobFit))
  
  simulatedfit <- glm(cbind(y.sim, n - y.sim) ~ size, data = lobGrouped, family = binomial)
  
  simDeviance[i] <- deviance(simulatedfit)
}

hist(simDeviance, breaks = 30, probability = TRUE,
     main = "Simulated distribution of residual deviances",
     xlab = "Residual deviance")

curve(dchisq(x, df = lobFit$df.residual), 
      col = "red", lwd = 2, add = TRUE)

legend("topright", legend = c("Predicted chi-square with 9 df"),
       col = "red", lwd = 2, lty = 1)

It does appear reasonable to compare the logistic model’s residual deviance to a chi-square distribution since the histogram of simulated deviances holds closely to the predicted chi-squared distribution with the same degrees of freedom.