Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

We focus on the machine learning part this week. Playing around with different regression and classification methods, we are trying to find a proper model to best illustrate the relationship between the air quality and different weather features. Now let’s take a look at what we have gotten so far.

First, we tried to use the Support Vector Machine (SVM) method to analyze dataset, and we used the CO dataset (i.e., the CO content in air). We divide the labels into two parts, above 0.1 and below 0.1. First of all we simply trained the model using maximum margin method without kernel function. The algorithm gives us model as shown in the following graph:

Apparently, it is not a decent classifier, and the accuracy of training data is 0.6416. Therefore, we consider using kernel function to make features into higher dimension. We use polynomial kernel function and dataset to train model, and it gives us model as shown in the following graph:

The accuracy of training data for this model is 0.6243. The accuracy of training data is even lower than the first one. Therefore, we tried to use radial basis kernel function to train the model, and we get a new model that looks like this:

This model increases the training accuracy to 0.6655, but it is still low.

According to these models, we feel at least for this dataset supporting vector machine method might not be a good way to build up data model.

For another air quality parameter, pm2.5, we used the logistic regression to do the classification.

According to US Environmental Protection Agency (EPA), when the content of pm2.5 in the air is larger than 12 ug/m3, it is considered to be harmful for human health. So we divide the pm2.5 data into two classes: HIGH (> 12 ug/m3) and LOW (<= 12 ug/m3), and use logistic regression to classify the data. The data set is based on the air quality in 5 months at Houston, and the base rate of HIGH pm2.5 is about 35%. The weather features include temperature, humidity, windspeed, windgust, pressure and visibility. All features were normalized into the 0~1 range by using the min-max normalization.

Firstly, we used the SGD method with L2 regularization to implement the logistic regression, and after tuning the parameters (i.e. threshold, α, and λ), we were able to get a training accuracy at 68.3%, and the confusion matrix is as follows:

True HIGH True LOW

Predict HIGH 259 139

Predict LOW 525 1170

It is not difficult to find that the recall of this model is pretty low (about 33%). Since the number of LOW observations is much larger than that of HIGH observations, this data set is unbalanced, and it may cause the fact that the model tries to predict the label that accounts for the majority in the data set, which results in the low recall.

To solve the imbalance problem, we chose to use the LogisticRegressionCV classifier in sklearn.linear_model, and this classifier allows us to adjust weights inversely proportional to class frequencies in the data set, which balances the data. We still use L2 regularization, and set the cross validation number to 5. The accuracy of the model fitting is 62%, and the confusion matrix is as follows:

True HIGH True LOW

Predict HIGH 484 494

Predict LOW 300 815

It can be seen that although the accuracy is lower than the first model, but the recall increases from 33% to 62%, and the current precision is about 50%. The MSE for the cross validation fitting is 0.38.

To further explore the logistic regression method, we introduced polynomial terms into the model. With introduction of squared term, the cross validation accuracy increases to 65%, and the recall is 63% and the precision is 46%, and the MSE is 0.35. By further introducing the cubic terms into the model, the accuracy turns to 66% with 64% recall and 53% precision, and the MSE is 0.34. Generally, the introduction of the polynomial terms helps to slightly increase the accuracy and precision of the model without bringing in extra errors, but does it mean that the polynomial model is a better model for this application purpose? We still need other proof to support this, and hopefully we will get a conclusion in the next blog post.

Accuracy(%) Recall(%) Precision(%) MSE

Linear 62 62 50 0.38

Square Poly 65 63 46 0.35

Cubic Poly 66 64 53 0.34

In addition, we also tested the correlation between changes in pollution and various weather parameters.

The output correlation matrix for the Newark NO2 readings using was:

NO2 (ppm)

temp -0.01204190

hum 0.09903328

windspeedmph -0.09779279

For each of the different parameters:

Temperature:

t = -0.53709, df = 1989, p-value = 0.5913

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.05594222 0.03190488

Humidity:

t = 4.4385, df = 1989, p-value = 9.554e-06

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.05534415 0.14234392

Windspeed:

t = -4.3824, df = 1989, p-value = 1.235e-05

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.14111646 -0.05409528

These results are somewhat surprising, since they indicate that it is unlikely that humidity and windspeed are correlated with the amount of NO2 pollution in the air.

In the upcoming week, we hope to be able to generalize these findings over a larger dataset. Our goal is to append all of the files with the same pollutant of interest, convert all readings to the same unit, then calculate the average percent change (taken against an average of all readings from the same station), and correlate that with the same weather parameters.

Here are some scatter plots for the data mentioned above:

For the O3 readings, similar results were observed:

Change (ppm)

Temperature -0.014148312

Humidity 0.001023036

Wind Speed -0.092476728