lobData <- read.csv("lobster.csv")
lobGrouped <- lobData %>%
group_by(size) %>%
summarise(
y = sum(survived),
n = n(),
p = y / n
) %>%
arrange(size)
lobFit <- glm(cbind(y, n - y) ~ size, data = lobGrouped, family = binomial)
targetProb <- 0.9
samples <- 1000
sizes <- numeric(samples)
set.seed(486)
for (x in 1:samples) {
simObservations <- rbinom(n = nrow(lobGrouped), size = lobGrouped$n, prob = predict(lobFit, type = "response"))
simModel <- glm(cbind(simObservations, lobGrouped$n - simObservations) ~ size, family = binomial, data = lobGrouped)
simCoefs <- coef(simModel)
sizes[x] <- (log(targetProb / (1 - targetProb)) - simCoefs[1]) / simCoefs[2]
}
parametricCI <- quantile(sizes, c(0.025, 0.975))
parametricCI
## 2.5% 97.5%
## 47.81241 56.36594
samplesNP <- 1000
sizesNP <- numeric(samplesNP)
for (y in 1:samplesNP) {
simSamples <- lobGrouped[sample(1:nrow(lobGrouped), replace = TRUE), ]
simModelNP <- glm(cbind(y, n - y) ~ size, family = binomial, data = simSamples)
simCoefsNP <- coef(simModelNP)
sizesNP[y] <- (log(targetProb / (1 - targetProb)) - simCoefsNP[1]) / simCoefsNP[2]
}
nonparametricCI <- quantile(sizesNP, c(0.025, 0.975))
nonparametricCI
## 2.5% 97.5%
## 47.98382 54.45976
From the two bootstrapped confidence intervals, we can conclude with 95% confidence that the size of the carapace where 90% of lobsters survive is between about 48cm and 55cm.
The parametric and non-parametric bootstrap simulations produce very similar confidence intervals because, although their methods differ, they start with the same underlying model. The parametric bootstrap assumes the estimated coefficients to be true and generates new observations directly from the fitted model. The non-parametric bootstrap instead uses the data itself, re-sampling observations randomly to produce new sample data. The coefficients used in the parametric bootstrap are estimated from the same data that the non-parametric bootstrap randomly re-samples from. This underlying link means that the two methods are bound to converge towards similar outputs.
writingData <- read.csv("handwriting-train.csv")
variables <- setdiff(names(writingData), "gender")
par(mfrow = c(2, 2))
for (v in variables) {
boxplot(writingData[[v]] ~ writingData$gender, main = paste("Boxplot of", v, "by Gender"), xlab = "Gender", ylab = v, col = c("lightblue", "lightpink"))
}
These box plots reveal a slight difference in means between males and females for chaincode1, curvature1, curvature4, and tortuosity1, and skewed distributions for chaincode4, curvature2, and curvature4. Other than this there are no clear relationships between gender and other explanatory variables.
gam.fit <- gam(I(gender == "M") ~ s(chaincode1) + s(chaincode2) + s(chaincode3) + s(chaincode4) + s(curvature1) + s(curvature2) + s(curvature3) + s(curvature4) + s(direction1) + s(tortuosity1), family = binomial, data = writingData)
par(mfrow = c(2, 2))
plot(gam.fit)
It appears that chaincode1 and curvature1 may call for quadratic terms in a final model, but the curves are not dramatic and chaincode1 does not display any clear trend other than being non-linear.
fit.all <- glm(I(gender=="M") ~ (chaincode1 + chaincode2 + chaincode3 + chaincode4 + curvature1 + curvature2 + curvature3 + curvature4 + direction1 + tortuosity1)^2 + I(curvature1^2)+I(direction1^2) , family = "binomial", data = writingData)
The given model is attempting to measure the main effects of all variables, all pairwise interactions between variables, and quadratic effects for curvature1 and direction1.
betterFit <- step(fit.all, direction = "backward", trace = 0)
formula(betterFit)
## I(gender == "M") ~ chaincode1 + chaincode2 + chaincode3 + chaincode4 +
## curvature1 + curvature2 + curvature3 + direction1 + tortuosity1 +
## I(curvature1^2) + chaincode1:chaincode3 + chaincode1:direction1 +
## chaincode2:chaincode3 + chaincode2:curvature2 + chaincode3:curvature1 +
## chaincode3:curvature3 + chaincode4:curvature1 + curvature1:direction1 +
## curvature1:tortuosity1 + curvature2:curvature3 + curvature3:direction1 +
## direction1:tortuosity1
predProbs <- predict(betterFit, type = "response")
predClass <- ifelse(predProbs > 0.5, "M", "F")
confMat <- table(Predicted = predClass, Actual = writingData$gender)
confMat
## Actual
## Predicted F M
## F 87 28
## M 27 90
trueMale <- confMat["M", "M"]
trueFemale <- confMat["F", "F"]
falseMale <- confMat["M", "F"]
falseFemale <- confMat["F", "M"]
sensitivity <- trueMale / (trueMale + falseFemale)
specificity <- trueFemale / (trueFemale + falseMale)
errorRate <- (falseMale + falseFemale) / (trueMale + falseMale + trueFemale + falseFemale)
sensitivity
## [1] 0.7627119
specificity
## [1] 0.7631579
errorRate
## [1] 0.237069
The sensitivity and specificity are very similar to each other at 76.27% and 76.32% respectively. These values being similar is good because it means our model is not biased towards either group. We also have an error rate of 23.7% which is fairly low given the lack of clear relationships between explanatory variables and gender that we saw in the box plots in part a.
observed <- ifelse(writingData$gender == "M", 1, 0)
roc <- roc(observed, predProbs, direction = "<")
## Setting levels: control = 0, case = 1
plot(roc)
optPoint <- coords(roc, "best", best.method = "youden", ret = c("threshold", "sensitivity", "specificity"))
optPoint
## threshold sensitivity specificity
## 1 0.4522439 0.8220339 0.7280702
optCutoff <- as.numeric(optPoint["threshold"])
optPredClass <- ifelse(predProbs > optCutoff, "M", "F")
optConfMat <- table(Predicted = optPredClass, Actual = writingData$gender)
optConfMat
## Actual
## Predicted F M
## F 83 21
## M 31 97
optTrueMale <- optConfMat["M", "M"]
optTrueFemale <- optConfMat["F", "F"]
optFalseMale <- optConfMat["M", "F"]
optFalseFemale <- optConfMat["F", "M"]
optSensitivity <- optTrueMale / (optTrueMale + optFalseFemale)
optSpecificity <- optTrueFemale / (optTrueFemale + optFalseMale)
optErrorRate <- (optFalseMale + optFalseFemale) / (optTrueMale + optFalseMale + optTrueFemale + optFalseFemale)
optSensitivity
## [1] 0.8220339
optSpecificity
## [1] 0.7280702
optErrorRate
## [1] 0.2241379
Our optimal values for sensitivity, specificity, and overall error are slightly adjusted from before with sensitivity increasing from 76.27% to 82.2%, specificity decreasing from 76.32% to 72.81%, and the overall error rate decreasing from 23.71% to 22.41%.
CV.fun <- function(train, test, cutoff) {
model <- glm(formula(betterFit), family = binomial, data = train)
predictedProbs <- predict(model, newdata = test, type = "response")
predictedClass <- ifelse(predictedProbs > cutoff, "M", "F")
mean(predictedClass != test$gender)
}
k <- 10
n <- nrow(writingData)
folds <- sample(rep(1:k, length.out = n))
cvErrors <- numeric(k)
for (i in 1:k) {
trainData <- writingData[folds != i, ]
testData <- writingData[folds == i, ]
cvErrors[i] <- CV.fun(trainData, testData, optCutoff)
}
cvErrorRate <- mean(cvErrors) * 100
cvErrorRate
## [1] 30.57971
Here we have done 10 fold cross validation and get an average error rate of about 30%. This is slightly larger than the error rate on the training data (by about 8%) which indicates some over fitting in our model. This error rate is fairly high and suggests that our model could be used as a good indication, but is not very reliable for predicting gender based on handwriting.
The conclusion to not use handwriting evidence to determine gender in a criminal case is made because of the contextually high error rate of our model. Our cross validation error rate is about 30% and we wouldn’t want to draw conclusions about suspects based on a prediction with such a high chance of being wrong. It seems just too important a decision to involve so much uncertainty.