Question 1

#Performing PCA and creating a scree plot
foodData <- read.csv("NHANES-DR.csv")
foodData_pca <- prcomp(foodData[-c(1:6)], center = T, scale. = T)
screeplot(foodData_pca, xlab = "Component", main = "Food Data Screeplot")

#Calculating the percent of the variability explained by the first four PCs
variability_explained <- foodData_pca$sdev^2
percent_variability <- (sum(variability_explained[1:4])/sum(variability_explained))*100
print(percent_variability)
## [1] 31.71472

I would select the first four principal components by following the elbow rule. It appears that after this,the variances level off almost completely. The first four principal components account for 31.71% of the variability.

Question 2

#Plotting the first two PCs
ggplot(foodData_pca$x, aes(x = PC1, y = PC2)) +
  geom_point(size = 2, alpha = 0.5) +
  labs(title = "PCA Score Plot",
       x = "PC1",
       y = "PC2")

#Extracting the most significant scores for PC1 and PC2, then looking at their values for the original data as well as those values' standard deviations from the mean
foodDataSD <- as.data.frame(scale(foodData))
slice(foodData, which.max(abs(foodData_pca$x[,1])))
##    SEQN RIAGENDR RIDAGEYR RIDRETH3 INDFMPIR DR1TNUMF F_CITMLB F_OTHER F_JUICE
## 1 64111        2       33        4     0.14       35        0    0.91    0.38
##   V_DRKGR V_REDOR_TOMATO V_REDOR_OTHER V_STARCHY_POTATO V_STARCHY_OTHER V_OTHER
## 1       0          0.805             0             4.66               0   1.995
##   V_LEGUMES G_WHOLE G_REFINED PF_MEAT PF_ORGAN PF_POULT PF_SEAFD_HI
## 1         0    0.53    23.685    2.88        0        0           0
##   PF_SEAFD_LOW PF_EGGS PF_SOY PF_NUTSDS PF_LEGUMES D_MILK D_YOGURT D_CHEESE
## 1            0    4.95   0.35     3.195          0   1.98        0     5.59
##      OILS SOLID_FATS ADD_SUGARS A_DRINKS
## 1 104.865     169.68     103.19        0
slice(foodDataSD, which.max(abs(foodData_pca$x[,1])))
##        SEQN  RIAGENDR   RIDAGEYR  RIDRETH3  INDFMPIR DR1TNUMF   F_CITMLB
## 1 -1.039456 0.9824542 0.04766291 0.3700793 -1.285242 4.222663 -0.3947041
##   F_OTHER     F_JUICE    V_DRKGR V_REDOR_TOMATO V_REDOR_OTHER V_STARCHY_POTATO
## 1 0.59203 -0.01081407 -0.4072184       1.911102    -0.4716314         10.86923
##   V_STARCHY_OTHER  V_OTHER  V_LEGUMES    G_WHOLE G_REFINED   PF_MEAT
## 1      -0.3777452 2.644576 -0.4333269 -0.2872391  5.620775 0.9310391
##      PF_ORGAN   PF_POULT PF_SEAFD_HI PF_SEAFD_LOW  PF_EGGS   PF_SOY PF_NUTSDS
## 1 -0.08509004 -0.7760222  -0.2113481   -0.3557993 7.147881 1.102103  2.070335
##   PF_LEGUMES   D_MILK   D_YOGURT D_CHEESE     OILS SOLID_FATS ADD_SUGARS
## 1 -0.4331311 1.081654 -0.3588566 6.581212 5.274667   6.626011   7.109581
##     A_DRINKS
## 1 -0.2852756
slice(foodData, which.max(abs(foodData_pca$x[,2])))
##    SEQN RIAGENDR RIDAGEYR RIDRETH3 INDFMPIR DR1TNUMF F_CITMLB F_OTHER F_JUICE
## 1 66075        1       69        6      2.1     19.5      0.7    0.08       0
##   V_DRKGR V_REDOR_TOMATO V_REDOR_OTHER V_STARCHY_POTATO V_STARCHY_OTHER V_OTHER
## 1   1.015              0          1.66                0               0   0.285
##   V_LEGUMES G_WHOLE G_REFINED PF_MEAT PF_ORGAN PF_POULT PF_SEAFD_HI
## 1     1.725   9.895     1.105       0        0        0       16.48
##   PF_SEAFD_LOW PF_EGGS PF_SOY PF_NUTSDS PF_LEGUMES D_MILK D_YOGURT D_CHEESE
## 1        1.395    1.16      0     1.335       6.96      0        0        0
##    OILS SOLID_FATS ADD_SUGARS A_DRINKS
## 1 74.49       2.74       2.53    0.165
slice(foodDataSD, which.max(abs(foodData_pca$x[,2])))
##         SEQN  RIAGENDR RIDAGEYR RIDRETH3    INDFMPIR  DR1TNUMF F_CITMLB
## 1 -0.3404591 -1.017723 1.529888 1.638294 -0.08891297 0.9675898 1.001518
##      F_OTHER    F_JUICE  V_DRKGR V_REDOR_TOMATO V_REDOR_OTHER V_STARCHY_POTATO
## 1 -0.6499705 -0.6116456 3.263571     -0.8449923      9.280361       -0.7252688
##   V_STARCHY_OTHER    V_OTHER V_LEGUMES  G_WHOLE G_REFINED    PF_MEAT
## 1      -0.3777452 -0.2502543  6.195277 7.945575  -1.32357 -0.7617063
##      PF_ORGAN   PF_POULT PF_SEAFD_HI PF_SEAFD_LOW  PF_EGGS     PF_SOY PF_NUTSDS
## 1 -0.08509004 -0.7760222    28.42708    0.8231485 1.125835 -0.2579996 0.6358148
##   PF_LEGUMES     D_MILK   D_YOGURT   D_CHEESE     OILS SOLID_FATS ADD_SUGARS
## 1   6.228568 -0.9401291 -0.3588566 -0.8726539 3.347155  -1.412866  -1.067524
##    A_DRINKS
## 1 -0.142923

The individuals with the most extreme value for PC1 and PC2 are observation number 64111 and 66075, respectively. Observation 64111 is a 33 year old white female with a family income to poverty ratio of 0.14. She seems to have the most significantly abnormal values for V_STARCHY_POTATO with SD=10.87, G_REFINED with SD=5.62, PF_SEAFD_LOW with SD=7.15, D_CHEESE with SD=6.58, OILS with SD=5.27, SOLID_FATS with SD=6.63, and ADD_SUGARS with SD=7.1. Some of these most notable consumption values come from the 4.66 cups of white potatoes, 5.59 cups of cheese, and 103.19 tsp of added sugars consumed.

Observation 66075 is a 69 year old asian male with a family income to poverty ratio of 2.1. He has the most abnormal values for V_REDOR_OTHER with SD=9.28, G_WHOLE with SD=7.95, PF_SEAFD_HI with SD=28.43, and PF_LEGUMES with SD=6.22. The most noticeable of these values by far comes from the 16.48oz of seafood high in n-3 fatty acids consumed.

Based on both the plot and these standard deviations, I would consider both of these points outliers. Our most extreme point for PC1 can be seen on the plot to be a considerable distance from all other points, and far from the spread of the vast majority of our data. A number of very significant standard deviations from our original variable means confirms that this individual is highly abnormal. Similarly, our extreme point for PC2 has one point scoring similarly on PC2, but otherwise is far from the rest of the data. In addition to this, the standard deviation for this observation’s “PF_SEAFD_HI” value is extremely high (28.43) and gives further evidence that this is an outlier.

Question 3

#Creating new data points with negative values by flipping the signs of some of our existing data values
existing_scores = as.data.frame(foodData_pca$x)
existing_scores$Type <- "Existing"
new_data <- -foodData[c(1 : 10), ]
new_scores <- as.data.frame(predict(foodData_pca, new_data))
new_scores$Type <- "New"

#Printing out illegal point scores
print(new_scores[, 1:4])
##          PC1        PC2         PC3        PC4
## 1  10.627635  0.3801105  2.00903859 -1.5250660
## 2   7.274084 -0.3649827 -0.07871487 -1.4138712
## 3   7.088933  0.3718255  0.99610706 -2.8546767
## 4   7.033425 -3.1768868  1.00635822  0.7358451
## 5   6.191414 -0.9891829  2.10191401 -1.5493542
## 6   4.729674 -0.1021937  0.85400490 -1.0103104
## 7   7.947578  0.7187352  1.75697555 -0.8766522
## 8   9.127800  0.5568375  1.70848467 -2.8038714
## 9   7.309081 -0.5584193  1.74225015 -4.3578476
## 10  6.296151 -0.2709154  1.45392101 -0.2016658
#Combining the old and new data points, then plotting them together in different colors
combined_pcs <- rbind(existing_scores, new_scores)
ggplot(combined_pcs, aes(x = PC1, y = PC2, color = Type)) +
  geom_point(size = 2, alpha = 0.5) +
  scale_color_manual(values = c("Existing" = "black", "New" = "red")) +
  labs(title = "PCA Score Plot",
       x = "PC1",
       y = "PC2",
       color = "Data Type")

From plotting a number of our original observations with flipped sign values, it does appear that the sharp wedge shape is a result of non- negativity constraints. All of the points with negative values show up outside of the area covered by our original data, which goes to show that the sign values are the factor constraining the points to that area of the plot.

Question 4

#Scaling the rotation matrix of for our first four PCs
scaled_pca <-list(rotation= foodData_pca$rotation[, 1:4], x=foodData_pca$x[, 1:4], sdev=foodData_pca$sdev[1:4]) 
for(i in 1:4){
  scaled_pca$x[,i]<-scaled_pca$x[,i]/scaled_pca$sdev[i];
  scaled_pca$rotation[,i]<-scaled_pca$rotation[,i]*scaled_pca$sdev[i]
}

#Performing a verimax rotation on our scaled PC rotation coefficients
rotated_pca <- varimax(scaled_pca$rotation)

#Printing out rounded versions of both the scaled and rotated matrices
print(round(scaled_pca$rotation, 3))
##                     PC1    PC2    PC3    PC4
## F_CITMLB         -0.051  0.290 -0.126  0.252
## F_OTHER          -0.041  0.420 -0.151  0.402
## F_JUICE          -0.072 -0.001 -0.070  0.181
## V_DRKGR          -0.090  0.429 -0.275 -0.116
## V_REDOR_TOMATO   -0.435  0.120 -0.022  0.073
## V_REDOR_OTHER    -0.036  0.417 -0.196  0.004
## V_STARCHY_POTATO -0.321 -0.143 -0.185 -0.290
## V_STARCHY_OTHER  -0.055  0.153 -0.082 -0.044
## V_OTHER          -0.334  0.434 -0.164 -0.191
## V_LEGUMES        -0.375  0.433  0.800 -0.078
## G_WHOLE          -0.111  0.418 -0.175  0.308
## G_REFINED        -0.728 -0.236  0.037  0.138
## PF_MEAT          -0.381 -0.122  0.023 -0.231
## PF_ORGAN         -0.008 -0.020  0.000 -0.086
## PF_POULT         -0.191 -0.008 -0.161 -0.178
## PF_SEAFD_HI      -0.015  0.303 -0.163 -0.094
## PF_SEAFD_LOW     -0.086  0.213 -0.178 -0.284
## PF_EGGS          -0.322 -0.003 -0.076 -0.141
## PF_SOY           -0.067  0.203 -0.130  0.129
## PF_NUTSDS        -0.261  0.337 -0.321  0.049
## PF_LEGUMES       -0.375  0.434  0.800 -0.078
## D_MILK           -0.122 -0.075  0.053  0.583
## D_YOGURT         -0.010  0.249 -0.023  0.359
## D_CHEESE         -0.556 -0.317  0.016  0.226
## OILS             -0.676  0.186 -0.310 -0.191
## SOLID_FATS       -0.742 -0.377 -0.024  0.228
## ADD_SUGARS       -0.532 -0.347 -0.097  0.047
## A_DRINKS         -0.187  0.011 -0.047 -0.393
print(round(rotated_pca$loadings, 3))
## 
## Loadings:
##                  PC1    PC2    PC3    PC4   
## F_CITMLB                 0.253         0.318
## F_OTHER                  0.335         0.495
## F_JUICE                                0.178
## V_DRKGR                  0.529              
## V_REDOR_TOMATO   -0.371  0.196  0.152  0.102
## V_REDOR_OTHER            0.440         0.110
## V_STARCHY_POTATO -0.350  0.119 -0.106 -0.307
## V_STARCHY_OTHER          0.186              
## V_OTHER          -0.168  0.552  0.158       
## V_LEGUMES                       0.982       
## G_WHOLE                  0.382         0.405
## G_REFINED        -0.761         0.128       
## PF_MEAT          -0.375               -0.252
## PF_ORGAN                                    
## PF_POULT         -0.188  0.160        -0.168
## PF_SEAFD_HI              0.349              
## PF_SEAFD_LOW             0.343        -0.218
## PF_EGGS          -0.299  0.144        -0.133
## PF_SOY                   0.216         0.179
## PF_NUTSDS        -0.172  0.484         0.141
## PF_LEGUMES                      0.982       
## D_MILK           -0.175 -0.183         0.547
## D_YOGURT                 0.135         0.409
## D_CHEESE         -0.639 -0.176         0.146
## OILS             -0.590  0.510        -0.126
## SOLID_FATS       -0.837 -0.159         0.136
## ADD_SUGARS       -0.628 -0.110              
## A_DRINKS         -0.147  0.162        -0.376
## 
##                  PC1   PC2   PC3   PC4
## SS loadings    3.110 2.170 2.045 1.552
## Proportion Var 0.111 0.078 0.073 0.055
## Cumulative Var 0.111 0.189 0.262 0.317

From comparing the scaled loadings with the rotated ones, it appears that the rotated values are more interpretable. Loading scores in the rotated matrix are genreally similar, but with slightly more extreme values than the scaled ones. We can look at a couple high loadings as examples. G_REFINED’s loading on PC1 goes from 0.728 to 0.761, V_OTHER’s loading on PC2 goes from 0.434 to 0.552, PF_LEGUMES’s loading on PC3 goes from 0.800 to 0.982, and G_WHOLE’s loading on PC4 goes from 0.308 to 0.405. There are a few instances of loadings becoming less significant, such as D_MILK’s loading on PC4 going from 0.583 to 0.547, but these are less common and have generally smaller effects than the other way around.

The most interpretable PCs are 1 and 3. PC1 has high loadings for refined grains, high fat foods, and added sugar, and appears to correspond to a typical unhealthy and highly processed diet. PC3 has extremely high loadings for both categories of legumes and seems to correspond directly to legume consumption.

There would not be any benefit to reflecting components as the only thing this would change is the sign of the loadings.

Question 5

#Performing LDA on the data
foodData_lda <- lda(RIDRETH3~., data = foodData[-c(1, 2, 3, 5, 6)])

#Performing 10 fold cross validation
hold<-sample(1:10, nrow(foodData), replace=TRUE)
predictions<-matrix(NA, ncol=ncol(foodData_lda$scaling), nrow=nrow(foodData))
for(i in 1:10){
  train_data <- foodData[hold != i,]
  test_data <-foodData[hold == i,]
  mod<-lda(RIDRETH3~.,data=train_data)
 for(j in 1:ncol(foodData_lda$scaling)){
  mod.predict<-predict(mod, newdata=test_data, dimen=j)
  predictions[hold==i,j]<-mod.predict$class
 }
}

#Summing up and printing the number of correct predictions using each number of LD components
for(k in 1:ncol(predictions)){
  print(sum(predictions[,k]==foodData$RIDRETH3, na.rm = TRUE))
}
## [1] 2390
## [1] 2791
## [1] 2892
## [1] 2890
## [1] 2902
#Using the decided upon three dimensional model to predict LD scores for our data
predicted_scores <- predict(foodData_lda, newdata=foodData, dimen=3)

#Producing a correlation matrix between scores for useful components and original foods
round(cor(foodData[-c(1:6)], predicted_scores$x), 2)
##                    LD1   LD2   LD3
## F_CITMLB         -0.17 -0.13 -0.10
## F_OTHER          -0.22 -0.13 -0.03
## F_JUICE           0.06  0.44  0.13
## V_DRKGR          -0.09 -0.05 -0.54
## V_REDOR_TOMATO   -0.02 -0.21  0.22
## V_REDOR_OTHER    -0.12 -0.18 -0.26
## V_STARCHY_POTATO  0.33 -0.09 -0.07
## V_STARCHY_OTHER   0.00  0.17  0.03
## V_OTHER          -0.20 -0.24 -0.17
## V_LEGUMES        -0.39  0.08  0.30
## G_WHOLE          -0.01 -0.28 -0.34
## G_REFINED        -0.26 -0.06  0.21
## PF_MEAT          -0.04 -0.13  0.04
## PF_ORGAN          0.00  0.05  0.04
## PF_POULT          0.13  0.41 -0.08
## PF_SEAFD_HI      -0.11  0.03 -0.23
## PF_SEAFD_LOW     -0.05  0.17 -0.32
## PF_EGGS          -0.04  0.02  0.14
## PF_SOY           -0.17 -0.11 -0.31
## PF_NUTSDS         0.09 -0.31 -0.18
## PF_LEGUMES       -0.39  0.08  0.30
## D_MILK           -0.27 -0.36  0.26
## D_YOGURT         -0.23 -0.19 -0.07
## D_CHEESE          0.11 -0.25  0.46
## OILS              0.20 -0.10 -0.17
## SOLID_FATS        0.25 -0.23  0.32
## ADD_SUGARS        0.45 -0.18  0.22
## A_DRINKS          0.20 -0.23 -0.06
#Making a confusion matrix for the chosen number of components
table(predicted_scores$class, foodData$RIDRETH3)
##    
##        1    2    3    4    6    7
##   1  237  136  101   64   67   17
##   2    1    3    0    3    1    0
##   3  485  312 1620  753  337  152
##   4  261  293  569 1170  224  113
##   6   57   31  111   74  281   13
##   7    0    0    0    0    0    0
# Generating a pairs plot of scores with a legend
unique_groups <- unique(foodData$RIDRETH3)
colors <- rainbow(length(unique_groups))
color_map <- setNames(colors, unique_groups)
pairs(predicted_scores$x, col = color_map[as.character(foodData$RIDRETH3)], pch = 19, main = "LDA Scores Plots")
legend("topright", legend = sort(unique_groups), col = colors, pch = 19, title = "Colors By Class", cex = 0.7, xpd = TRUE, horiz = TRUE, inset = c(0.02, -0.05))

I’ve chosen to select the first three components because the number of correct classifications levels off quickly after past three.

It appears that dark fruit juices, green vegetables, both types of legumes, poultry, cheese and added sugars are some of categories that help the most in classification since these have some of the highest correlations with LD components.

The confusion matrix is not very convincing as far as our ability to classify successfully with three components. We see relatively high correct classification rates for a couple groups, but extremely low rates for all other. Notably, we see a significant under representation of groups 2 and 7. This tells us that our model represents trends only.