INTRO
San Bernardino and surrounding Southern California counties currently do not meet ozone requirements set by the National Ambient Air Quality Standard (NAAQS). Additionally, in The American Lung Association’s 2015 report “State of the Air”, San Bernardino tops the list for most ozone-polluted counties. If exposed to levels above the standards, significant harmful health effects could occur among adults and children. Air Quality Index (AQI) forecasts, provided by the US Environmental Protection Agency (EPA), could potentially play a role as part of a local air quality management system. Thus, here, we utilize Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, and Principle Component Analysis to classify daily AQI based on the various meteorological conditions.
DATA
Meteorological data is provided by the University of California Statewide Integrated Pest Management Program (UC-IPMP). AQI data are based on daily maximum ozone records (in ppb), collected from the monitoring station in Upland, CA from January 2001 to December 2004. They are classified into five stages, details of which may be viewed here.
The meteorological variables are as follows:
- Daily maximum air temperature (Fahrenheit),
- Daily reference evapotranspiration (inches),
- Daily total precipitation (inches),
- Daily maximum relative humidity (%),
- Daily maximum soil temperature (Fahrenheit),
- Daily global radiation (Watts/m2 ),
- Daily west-east wind (m/s), and
- Daily south-north wind (m/s).
Additionally, we have data on the day of the week, and month of the year.
The data are split into two sets, training (20001-2003) and validation (2004), with the former having 1095 observations and the latter 366.
We first view the data and check for any missing values. We impute using k-nearest neighbor.
str(oz1); describe(oz1); head(oz1); # Check for missing values missmap(oz1, main = "Missing Values vs. Observed") #Impute via K-Nearest neighbors oz =knnImputation(oz1[, !names(oz1) %in% "medv"])
# Month vs Max Temperature plot plot(oz$temp.max, oz$month, xlab = "Max Temperature", ylab = "Month", pch=19, col=family)

# Chi-squared Test of Independence # Subset all binary variables n = ncol(oz_dummy) for (i in 2:n) { print(i) print(table(oz_dummy$aqi, oz_dummy[,c(i)])) print(chisq.test(table(oz_dummy$aqi, oz_dummy[,c(i)]))) } # indpendent: 5 (sep.oct) oz_dummy2 = data.frame(oz_dummy[,-5])
# Scatterplot Matrix oz_integer <- oz[,sapply(oz, is.numeric)] scatterplotMatrix(oz_integer, by.groups = T)

# Multivariate tests of normality. result <- mardiaTest(oz_integer, qqplot = T); result # Mardia's Multivariate Normality Test #--------------------------------------- # data : oz_integer # g1p : 18.80226 # chi.skew : 3431.412 # p.value.skew : 0 # g2p : 90.26179 # z.kurtosis : 40.18335 # p.value.kurt : 0 # chi.small.skew : 3443.169 # p.value.small : 0 # Result : Data are not multivariate normal. #--------------------------------------- result2 <- hzTest(oz_integer, qqplot = T); result2 # Henze-Zirkler's Multivariate Normality Test #--------------------------------------------- # data : oz_integer # # HZ : 7.947721 # p-value : 0 # # Result : Data are not multivariate normal. ---------------------------------------------
Methods
Principle Component Analysis
Now we attempt to reduce the dimensionality of our data, or possibly construct another avenue for model inputs, through principle component analysis. The correlation plot is given below, further below we see the principle components in the output.
oz.pca <- prcomp(oz_integer, + center = TRUE, + scale. = T) print(oz.pca) # #Standard deviations: #[1] 1.8966949 1.1832895 1.0585143 0.6338274 0.6135414 0.2887376 0.1427663 # #Rotation: # PC1 PC2 PC3 PC4 PC5 PC6 PC7 #temp.max 0.456529683 0.25568689 0.15095961 -0.37710262 -0.3230251 -0.65990989 -0.14623053 #soil.max 0.477657559 -0.14445429 0.07664501 -0.36133000 -0.3689124 0.68995291 -0.04910183 #solar 0.485081705 0.00856376 -0.03101141 0.23878958 0.5678974 0.05417936 -0.61741180 #eto 0.496099306 0.17738651 -0.07926773 0.09301567 0.3579506 -0.00323261 0.76114723 #rh.max -0.006794696 -0.78700168 0.04047397 -0.45953746 0.3328360 -0.22080303 0.09075007 #wind.u -0.268173138 0.44925866 0.50421763 -0.50454770 0.4270235 0.18667690 -0.01576933 #wind.v -0.100210396 0.24700782 -0.84155695 -0.44251420 0.1256715 0.04399794 -0.08472875
Below we can see the PC’s encircling the observations, and the direction of the significant weights for the PC’s.
Analyzing the PCs we decided the best course of action was simply to drop the wind.u variable, since it was highly correlated with rh.max, solar, and on ozone season month variables. Also, the wind.v variable, which was not as highly correlated with other variables as wind.u, managed to serve a similar purpose. Using PCA for data reduction allowed us to drop one variable.
Logistic Regression
Multinomial logistic regression is used when the outcome variable has more than two nominal levels, as in this case.
After experimenting with the dummy variables, we noticed the interaction of eto and weekend, and rh.max with Jul-Aug, provided the best classification rates. The results were a 15.8% misclassification rate.
oz_combined <- data.frame(oz_dummy2, oz_integer) oz_combined$aqi <- as.factor(oz_combined$aqi) test <- multinom(aqi ~ rh.max + jul.aug + soil.max + wind.v + temp.max + eto + weekend + solar + may.june + rh.max*jul.aug + eto*weekend, data = oz_combined, reflevel="2") exp(coef(test)) summary(test) #test # weights: 56 (39 variable) #final value 404.957928 #stopped after 100 iterations #Residual Deviance: 809.9159 #AIC: 887.9159 # p-values z pv pv # (Intercept) rh.max jul.aug soil.max wind.v temp.max eto #weekend solar nov.apr #3 0 2.450356e-05 0 1.632255e-02 0.0001363624 0 0 #0 0 0.48402314 #4 0 4.985382e-07 0 5.406290e-03 0.0366101111 0 0 #0 0 0.01763019 #5 0 1.993009e-07 0 1.307147e-08 0.0005853189 0 0 #0 0 0.00000000 # may.june rh.max:jul.aug eto:weekend #3 0.039853772 0 0 #4 0.003565009 0 0 #5 0.000739145 0 0 # Test model on validation set test2 = predict(test, newdata = oz_train) table.log = table(test2,oz_train[,1]) table.log #test2 2 3 4 5 # 2 786 49 7 1 # 3 32 76 25 11 # 4 2 12 19 8 # 5 0 8 18 41 sum(diag(table.log))/sum(table.log) # 0.8420091.
Testing the model on the validation set, we achieve a classification rate of 85.8%.
test.predict = predict(test, newdata = oz_test) table.log.test = table(test.predict,oz_test[,1]) table.log.test # #test.predict 2 3 4 5 # 2 273 13 2 1 # 3 11 31 7 4 # 4 0 3 3 2 # 5 2 4 3 7 sum(diag(table.log.test))/sum(table.log.test) # [1] 0.8579235
Results
Though the data are not multivariate normal, classification models which hold to these assumptions were built to verify and compare the accuracy of each approach. Table 5 displays the results. The apparent error rate (AER) is the overall misclassification rate of the model when attempting to classify the data which used to build the model itself. To compute the Expected Error rate (E(AER)), we also used the training data, but through leave-one-out cross validation. That is, the model parameters are built from all data minus one observation, then the model is used to classify the left-out observation. We then compare these predicted classifications with the actual AQI classification to compute the error.
The results of the PCA were used as inputs of multinomial logistic regression. These performed slightly better on the training data, but were not as successful on the validation data. Logistic regression, using non-transformed data, performed best with a AER of 15.8% on training data and 14.2% on the validation data. Note that the model predicted better on the test data than the training data. This indicates we were successful in avoiding over-fitting, and the model is well generalized. We may state logistic regression performed best since it is not sensitive to assumptions of normality and homogeneity of covariance-variance structures. The variables used in the final model may be seen in the above R output.
Conclusion
Utilizing Principle Component Analysis and, among other techniques, multinomial logistic regression, a classification model for AQI states was built using meteorological data. The Apparent Error Rate on the validation data was 14.2%. However, the misclassification rate of our categories of interest, AQI states 4 and 5 (states in which people’s health are at risk), were high; 62.5%and 56.25%respectively.
Further methods which could be conducted in order to lower the apparent error rate and achieve maximum classification accuracy could be Decision Tree (Random Forest), methods since they provide intuitive decision rules and implicitly take into account for interaction among variables, which may be key. Additionally, since the data are obtained at successive time intervals, considering Time Series components in the model should be appropriate.