STATS 302 Assignment 3

Question 1

###Q1
A1 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
A2 <- c(3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

B1 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
B2 <- c(2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1)

manhattan_distanceA <- sum(abs(A1 - A2))
euclidean_distanceA <- sqrt(sum((A1 - A2)^2))
manhattan_distanceB <- sum(abs(B1 - B2))
euclidean_distanceB <- sqrt(sum((B1 - B2)^2))

Comparing Manhattan distances, we have a value of 2 between the first pair of points and 2 between the second pair. We can then go on to compare Euclidean distances with a value of 2 between the first pair and 1.4142136 between the second pair. It is evident that the two pairs have the same Manhattan distance and different Euclidean distances. This is because Manhattan distances take the total absolute difference in variable values and use that as the distance, treating two variable differences of 1 the same as one variable difference of two. Euclidean distances, on the other hand, take the “straight line” distance between points, using the distance formula to compute. This means that two variable differences of 1 are not considered the same as one variable difference of two.

Question 2

###Q2
distances <- dist(schoolData[2:12], "manhattan")

schoolPCO <- cmdscale(distances, k=2, eig = TRUE)
plot(schoolPCO$eig)

gof <- schoolPCO$GOF[1]
axis1Var <- schoolPCO$eig[1]/(schoolPCO$eig[1] + schoolPCO$eig[2])
axis2Var <- schoolPCO$eig[2]/(schoolPCO$eig[1] + schoolPCO$eig[2])
gof

## [1] 0.4739983

axis1Var

## [1] 0.8422748

axis2Var

## [1] 0.1577252

From the scree plot of eigenvalues, the elbow rule suggests that we should use two dimensions, because the relative importance of further dimensions levels off dramatically.

We can then see from the GOF value that the first two dimensions represent about 47.4% of the variability in our plot. Additionally, we observe that the first axis accounts for 84.23% of that variability and the second axis accounts for 15.77%.

Question 3

###Q3
pointsPCO <- as.data.frame(cbind(schoolData$location, schoolPCO$points))
colnames(pointsPCO) <- c("location", "axis1", "axis2")
ggplot(pointsPCO, aes(x = as.numeric(axis1), y = as.numeric(axis2))) +
  geom_point() +
  labs(x = "Axis 1", y = "Axis 2")+
  facet_wrap(~location)

schoolCounts <- summarize(group_by(schoolData, location), numSchools = n())
schoolCounts

## # A tibble: 4 × 2
##   location      numSchools
##   <chr>              <int>
## 1 City                  38
## 2 Suburban              32
## 3 Town                  19
## 4 Village.Rural         20

There are visible differences between location types. It seems that most of the locations in cities are closely correlated with axis 1, and as locations become more rural they become more correlated with axis 2. It is also worth noting that we have considerably fewer observations for towns and villages than cities and suburbs, so this could be contributing to the difference in the spread of data.

Question 4

###Q4
schoolPermAn <- adonis2(schoolData[,2:12]~schoolData[,1], method="bray")
schoolPermAn

## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = schoolData[, 2:12] ~ schoolData[, 1], method = "bray")
##           Df SumOfSqs      R2      F Pr(>F)  
## Model      3  0.16805 0.07423 2.8063  0.018 *
## Residual 105  2.09594 0.92577                
## Total    108  2.26399 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we have a significant p-value for a difference between groups. We saw a slight difference in covariance in question three, but the result from our permanova seems to be significant enough that we can conclude that it represents a difference in group means.

Question 5

###Q5
combined <- rowSums(schoolData[2:12])

cor(combined, as.numeric(pointsPCO$axis1))

## [1] 0.9977324

cor(combined, as.numeric(pointsPCO$axis2))

## [1] 0.0209863

This combined scores are very strongly correlated with axis 1 and very weakly correlated with axis 2. I think this would be a clever suggestion if we had decided to only use one dimension to represent our data, but since we found two to be meaningful, this method will result in us missing out on potentially important variability. Additionally, since axis 2 seems to be more heavily represented in rural schools, this method could bias our data towards representation of urban areas.