Classifying Daily Air Quality

INTRO

San Bernardino and surrounding Southern California counties currently do not meet ozone requirements set by the National Ambient Air Quality Standard (NAAQS). Additionally, in The American Lung Association’s 2015 report “State of the Air”, San Bernardino tops the list for most ozone-polluted counties. If exposed to levels above the standards, significant harmful health effects could occur among adults and children. Air Quality Index (AQI) forecasts, provided by the US Environmental Protection Agency (EPA), could potentially play a role as part of a local air quality management system. Thus, here, we utilize Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, and Principle Component Analysis to classify daily AQI based on the various meteorological conditions.

 

DATA

Meteorological data is provided by the University of California Statewide Integrated Pest Management Program (UC-IPMP). AQI data are based on daily maximum ozone records (in ppb), collected from the monitoring station in Upland, CA from January 2001 to December 2004. They are classified into five stages, details of which may be viewed here.

The meteorological variables are as follows:

  • Daily maximum air temperature (Fahrenheit),
  • Daily reference evapotranspiration (inches),
  • Daily total precipitation (inches),
  • Daily maximum relative humidity (%),
  • Daily maximum soil temperature (Fahrenheit),
  • Daily global radiation (Watts/m2 ),
  • Daily west-east wind (m/s), and
  • Daily south-north wind (m/s).

Additionally, we have data on the day of the week, and month of the year.

The data are split into two sets, training (20001-2003) and validation (2004), with the former having 1095 observations and the latter 366.

We first view the data and check for any missing values. We impute using k-nearest neighbor.

str(oz1); describe(oz1); head(oz1);
# Check for missing values
missmap(oz1, main = "Missing Values vs. Observed")
#Impute via K-Nearest neighbors
oz =knnImputation(oz1[, !names(oz1) %in% "medv"])
Based on the patterns in the temperature and month plot below (where January = 1, …), we decided to group the months variable into  four different dummy variables; Nov-Apr, May-June, July-Aug, Sep-Oct.
# Month vs Max Temperature plot
plot(oz$temp.max, oz$month, xlab = "Max Temperature", ylab = "Month", pch=19, col=family)
Rplot
 To see if any of these month dummy variables, along with the weekend/not weekend dummy variable, were significant we conduct the Chi-Square test of independence, for each variable and AQI classification. We find three month variables and the weekend variable are significant, i.e., due to the low p-value we may reject the null hypothesis and conclude their may exist some predictive interaction between the variable and AQI state.
# Chi-squared Test of Independence
# Subset all binary variables
n = ncol(oz_dummy)
for (i in 2:n) {
print(i)
print(table(oz_dummy$aqi, oz_dummy[,c(i)]))
print(chisq.test(table(oz_dummy$aqi, oz_dummy[,c(i)])))
}
# indpendent: 5 (sep.oct)
oz_dummy2 = data.frame(oz_dummy[,-5])
Now, observing the continuous variables in a scatterplot matrix, we note we are far from multivariate normality. Due to this, the methods of Linear Discriminant and Quadratic Discriminant analysis will be of little use to us. From Wikipedia, “Quadratic discriminant analysis (QDA) is closely related to linear discriminant analysis (LDA), where it is assumed that the measurements from each class are normally distributed. Unlike LDA however, QDA has no assumption that the covariance of each of the classes is identical.” To confirm this we utilize the multivariate tests available in MVN: An R Package for Assessing Multivariate Normality.
# Scatterplot Matrix
oz_integer <- oz[,sapply(oz, is.numeric)]
scatterplotMatrix(oz_integer, by.groups = T)
Rplot01
# Multivariate tests of normality.
result <- mardiaTest(oz_integer, qqplot = T); result
# Mardia's Multivariate Normality Test
#---------------------------------------
# data : oz_integer
# g1p : 18.80226
# chi.skew : 3431.412 # p.value.skew : 0
# g2p : 90.26179
# z.kurtosis : 40.18335
# p.value.kurt : 0
# chi.small.skew : 3443.169 # p.value.small : 0
# Result : Data are not multivariate normal.
#---------------------------------------
result2 <- hzTest(oz_integer, qqplot = T); result2
# Henze-Zirkler's Multivariate Normality Test
#---------------------------------------------
# data : oz_integer
#
# HZ : 7.947721
# p-value : 0
#
# Result : Data are not multivariate normal.
---------------------------------------------

 

Methods

Principle Component Analysis

Now we attempt to reduce the dimensionality of our data, or possibly construct another avenue for model inputs, through principle component analysis. The correlation plot is given below, further below we see the principle components in the output.

Rplot03

oz.pca <- prcomp(oz_integer, + center = TRUE, + scale. = T)
print(oz.pca)
#
#Standard deviations:
#[1] 1.8966949 1.1832895 1.0585143 0.6338274 0.6135414 0.2887376 0.1427663
#
#Rotation:
#                  PC1         PC2         PC3         PC4        PC5         PC6         PC7
#temp.max  0.456529683  0.25568689  0.15095961 -0.37710262 -0.3230251 -0.65990989 -0.14623053
#soil.max  0.477657559 -0.14445429  0.07664501 -0.36133000 -0.3689124  0.68995291 -0.04910183
#solar     0.485081705  0.00856376 -0.03101141  0.23878958  0.5678974  0.05417936 -0.61741180
#eto       0.496099306  0.17738651 -0.07926773  0.09301567  0.3579506 -0.00323261  0.76114723
#rh.max   -0.006794696 -0.78700168  0.04047397 -0.45953746  0.3328360 -0.22080303  0.09075007
#wind.u   -0.268173138  0.44925866  0.50421763 -0.50454770  0.4270235  0.18667690 -0.01576933
#wind.v   -0.100210396  0.24700782 -0.84155695 -0.44251420  0.1256715  0.04399794 -0.08472875

Below we can see the PC’s encircling the observations, and the direction of the significant weights for the PC’s.

Analyzing the PCs we decided the best course of action was simply to drop the wind.u variable, since it was highly correlated with rh.max, solar, and on ozone season month variables. Also, the wind.v variable, which was not as highly correlated with other variables as wind.u, managed to serve a similar purpose. Using PCA for data reduction allowed us to drop one variable.

Rplot02

 

Logistic Regression
Multinomial logistic regression is used when the outcome variable has more than two nominal levels, as in this case.

After experimenting with the  dummy variables, we noticed the interaction of eto and weekend, and rh.max with Jul-Aug, provided the best classification rates. The results were a 15.8% misclassification rate.

oz_combined <- data.frame(oz_dummy2, oz_integer)
oz_combined$aqi <- as.factor(oz_combined$aqi)

test <- multinom(aqi ~ rh.max + jul.aug + soil.max + wind.v + 
                        temp.max + eto + weekend + solar + 
                        may.june + rh.max*jul.aug + 
                        eto*weekend, data = oz_combined, 
                        reflevel="2")
exp(coef(test))
summary(test)
#test 
# weights: 56 (39 variable)
#final value 404.957928
#stopped after 100 iterations

#Residual Deviance: 809.9159
#AIC: 887.9159

# p-values

z pv pv
# (Intercept) rh.max jul.aug soil.max wind.v temp.max eto #weekend solar nov.apr
#3 0 2.450356e-05 0 1.632255e-02 0.0001363624 0 0 #0 0 0.48402314
#4 0 4.985382e-07 0 5.406290e-03 0.0366101111 0 0 #0 0 0.01763019
#5 0 1.993009e-07 0 1.307147e-08 0.0005853189 0 0 #0 0 0.00000000
# may.june rh.max:jul.aug eto:weekend
#3 0.039853772 0 0
#4 0.003565009 0 0
#5 0.000739145 0 0

# Test model on validation set
test2 = predict(test, newdata = oz_train)
table.log = table(test2,oz_train[,1])
table.log

#test2 2 3 4 5
# 2 786 49 7 1
# 3 32 76 25 11
# 4 2 12 19 8
# 5 0 8 18 41
sum(diag(table.log))/sum(table.log)
# 0.8420091.

Testing the model on the validation set, we achieve a classification rate of 85.8%.

test.predict = predict(test, newdata = oz_test)
table.log.test = table(test.predict,oz_test[,1])
table.log.test
#
#test.predict 2 3 4 5
# 2 273 13 2 1
# 3 11 31 7 4
# 4 0 3 3 2
# 5 2 4 3 7
sum(diag(table.log.test))/sum(table.log.test)
# [1] 0.8579235

 

Results

Though the data are not multivariate normal, classification models which hold to these assumptions were built to verify and compare the accuracy of each approach. Table 5 displays the results. The apparent error rate (AER) is the overall misclassification rate of the model when attempting to classify the data which used to build the model itself. To compute the Expected Error rate (E(AER)), we also used the training data, but through leave-one-out cross validation. That is, the model parameters are built from all data minus one observation, then the model is used to classify the left-out observation. We then compare these predicted classifications with the actual AQI classification to compute the error.

Picture1.png

The results of the PCA were used as inputs of multinomial logistic regression. These performed slightly better on the training data, but were not as successful on the validation data. Logistic regression, using non-transformed data, performed best with a AER of 15.8% on training data and 14.2% on the validation data. Note that the model predicted better on the test data than the training data. This indicates we were successful in avoiding over-fitting, and the model is well generalized. We may state logistic regression performed best since it is not sensitive to assumptions of normality and homogeneity of covariance-variance structures. The variables used in the final model may be seen in the above R output.

 

Conclusion

Utilizing Principle Component Analysis and, among other techniques, multinomial logistic regression, a classification model for AQI states was built using meteorological data. The Apparent Error Rate on the validation data was 14.2%. However, the misclassification rate of our categories of interest, AQI states 4 and 5 (states in which people’s health are at risk), were high; 62.5%and 56.25%respectively.

Further methods which could be conducted in order to lower the apparent error rate and achieve maximum classification accuracy could be Decision Tree (Random Forest), methods since they provide intuitive decision rules and implicitly take into account for interaction among variables, which may be key. Additionally, since the data are obtained at successive time intervals, considering Time Series components in the model should be appropriate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s