Second Update: Machine Learning Work

Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

We focus on the machine learning part this week. Playing around with different regression and classification methods, we are trying to find a proper model to best illustrate the relationship between the air quality and different weather features. Now let’s take a look at what we have gotten so far.

First, we tried to use the Support Vector Machine (SVM) method to analyze dataset, and we used the CO dataset (i.e., the CO content in air). We divide the labels into two parts, above 0.1 and below 0.1. First of all we simply trained the model using maximum margin method without kernel function. The algorithm gives us model as shown in the following graph:

svm1

Apparently, it is not a decent classifier, and the accuracy of training data is 0.6416. Therefore, we consider using kernel function to make features into higher dimension. We use polynomial kernel function and dataset to train model, and it gives us model as shown in the following graph:

svm2

The accuracy of training data for this model is 0.6243. The accuracy of training data is even lower than the first one. Therefore, we tried to use radial basis kernel function to train the model, and we get a new model that looks like this:

svm3

This model increases the training accuracy to 0.6655, but it is still low.

According to these models, we feel at least for this dataset supporting vector machine method might not be a good way to build up data model.

For another air quality parameter, pm2.5, we used the logistic regression to do the classification.

According to US Environmental Protection Agency (EPA), when the content of pm2.5 in the air is larger than 12 ug/m3, it is considered to be harmful for human health. So we divide the pm2.5 data into two classes: HIGH (> 12 ug/m3) and LOW (<= 12 ug/m3), and use logistic regression to classify the data. The data set is based on the air quality in 5 months at Houston, and the base rate of HIGH pm2.5 is about 35%. The weather features include temperature, humidity, windspeed, windgust, pressure and visibility. All features were normalized into the 0~1 range by using the min-max normalization.

Firstly, we used the SGD method with L2 regularization to implement the logistic regression, and after tuning the parameters (i.e. threshold, α, and λ), we were able to get a training accuracy at 68.3%, and the confusion matrix is as follows:

                             True HIGH        True LOW

Predict HIGH            259                  139

Predict LOW             525                  1170

It is not difficult to find that the recall of this model is pretty low (about 33%). Since the number of LOW observations is much larger than that of HIGH observations, this data set is unbalanced, and it may cause the fact that the model tries to predict the label that accounts for the majority in the data set, which results in the low recall.

To solve the imbalance problem, we chose to use the LogisticRegressionCV classifier in sklearn.linear_model, and this classifier allows us to adjust weights inversely proportional to class frequencies in the data set, which balances the data. We still use L2 regularization, and set the cross validation number to 5. The accuracy of the model fitting is 62%, and the confusion matrix is as follows:

                             True HIGH        True LOW

Predict HIGH            484                 494

Predict LOW             300                  815

It can be seen that although the accuracy is lower than the first model, but the recall increases from 33% to 62%, and the current precision is about 50%. The MSE for the cross validation fitting is 0.38.

To further explore the logistic regression method, we introduced polynomial terms into the model. With introduction of squared term, the cross validation accuracy increases to 65%, and the recall is 63% and the precision is 46%, and the MSE is 0.35. By further introducing the cubic terms into the model, the accuracy turns to 66% with 64% recall and 53% precision, and the MSE is 0.34. Generally, the introduction of the polynomial terms helps to slightly increase the accuracy and precision of the model without bringing in extra errors, but does it mean that the polynomial model is a better model for this application purpose? We still need other proof to support this, and hopefully we will get a conclusion in the next blog post.

                              Accuracy(%)       Recall(%)       Precision(%)      MSE

Linear                           62                     62                    50                  0.38

Square Poly                 65                     63                    46                  0.35

Cubic Poly                   66                     64                     53                 0.34

 

In addition, we also tested the correlation between changes in pollution and various weather parameters.

The output correlation matrix for the Newark NO2 readings using was:

                                     NO2 (ppm)
temp                          -0.01204190
hum                           0.09903328
windspeedmph      -0.09779279

For each of the different parameters:

Temperature:
t = -0.53709, df = 1989, p-value = 0.5913
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.05594222   0.03190488

Humidity:

t = 4.4385, df = 1989, p-value = 9.554e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.05534415 0.14234392

Windspeed:

t = -4.3824, df = 1989, p-value = 1.235e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.14111646 -0.05409528

These results are somewhat surprising, since they indicate that it is unlikely that humidity and windspeed are correlated with the amount of NO2 pollution in the air.

In the upcoming week, we hope to be able to generalize these findings over a larger dataset. Our goal is to append all of the files with the same pollutant of interest, convert all readings to the same unit, then calculate the average percent change (taken against an average of all readings from the same station), and correlate that with the same weather parameters.

Here are some scatter plots for the data mentioned above:

corre1

Picture2

corre3

For the O3 readings, similar results were observed:

                                         Change (ppm)
Temperature                -0.014148312
Humidity                      0.001023036
Wind Speed                 -0.092476728

corre4

corre5

 

First Update: A Change in Course and some cool Graphs

Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

Since our first blog post its been a while.  First things first: our project has shifted.  Tweet data fell through, and without tweets we had no way of the mood people where in.  We even tried collecting data ourselves.  The issue became that even though we could get thousands of tweets for a geographic location, we had no way of getting them over a longer time frame then a few weeks, so we had no way to filter out seasonal variation both in weather and in mood.  So we scrapped anything to do with sentiment and tweets and moved to pollution.  Pollution is great because it matters.  It effect peoples health and has serious environmental consequences.  Pollution is also great because we found a great data set: OpenAQ.  So now we have our data: OpenAQ and Weather Underground, and an idea.  This catches us up to where we were a week ago.

We set out this week with two goals in mind.  Firstly, we wanted to show that our data showed something.  We wanted to be able to throw up a graph and say: Here, see, we have something.  If we couldn’t manage this we would have to go with a backup plan.  Secondly, although we knew that we wanted to use a multiple linear regression model to model our data, we needed to find a way to apply weights in this model.  Weights are necessary because in all likelyhood, and as future investigation showed, all factors are not created equal.

So, visualization first.  We tried graphing everything against everything, for one collection station in Newark.  Naturally some graphs turned out like this (NO2 vs Temperature):

NO2(ppm) vs T (F)

Which is to be expected when you just throw up every weather factor and pollution measurement.  However, others give us much more promising graphs (NO2 vs Wind):

NO2(ppm) vs Wind (mph)

Hey! This means we have found something interesting and even have some graphs to show our TA when we meet up with him.

So, onto our modeling. We’ve decided to use the multiple linear regression to model the relationship between a specific air quality (e.g. pm25) and a series of independent weather variables (e.g. temperature, humidity, wind_speed, pressure, etc.).  Important note: these variable may very well not be independent.  For example, humidity and temperature are likely fairly correlated.  This potential double counting will naturally be addressed in the more refined model by creating a combined weather humidity factor.  For simplicity of explaining our approach, we’ll treat our “Factors” as independent.  

    We will first use the Stochastic Gradient Descent (SGD) to determine the individual weights associated with each weather feature, and we will use 90% of our collected data as training data and 10% as test data. Besides directly using SGD to determine the weights on all the weather parameters (the full model), it is possible that some weather parameters do not have too much effect on the specific air quality, and we want to remove those uninformative variables from our model, and only keep the relevant variables.

    So how do we determine which parameters are important and which parameters are uninformative? We will apply the “forward selection” for our variable selection. Specifically, we start with models with only one variable (e.g. wind_speed), and calculate the proportion of explained variation (PVE) and the Akaike Information Criterion (AIC) of each model. Then we choose the best model (with the highest PVE value and the lowest AIC value), and for the second run of selection, we add another variable (e.g. temperature) to this model to form a two-predictor model, and we compare the PVEs and AICs of all the possible two-predictor models (i.e. wind_speed + temperature, wind_speed + temperature, wind_speed + pressure, etc.). Then we choose the best model from the second run, and if this model has a higher AIC value than the single-variable model we obtained from the first run, indicating that this model probably overfits the data, we stop adding any more variable to the model, otherwise, we will perform the third run of selection, by adding one more variable to the model, and compare the PVEs and AICs of those new models. We stop the iteration until we find that the AIC increases by adding more variable, otherwise we add all the variables to the model.

Toward the end of this week we’ve discussed some more fine tuned aspects of this approach, including some very helpful input from our TA, and we’ve added some factors of consideration.  We could use the BIC instead of the AIC for assessing model fit for potentially better results because of better overfitting protection.  Other analysis that could make a large impact or otherwise may be useful is using z-scores to analyze wether our finding carry much relevance and using MDL for a potentially different approach.

    If the above models do not give a decent accuracy, we might want to introduce polynomials to the regression function, for example, some variables might need a square or even cubic term in the model.  Furthermore, we have plans to divide pollution into different zones, potentially based on health hazard, allowing us to use SVM or another classifier.  This could provide us with further insight and other interesting ways of visualizing our results.

With regards to statistically exploring our data, we calculated the change in pollution (ppm/other units) with regards to the previous pollution reading. Then, we plan to test the correlation between different weather events and the change that occurs in the pollution reading. Because this data is trying to find the correlation between categorical variables (weather) and continuous variables (pollution), we plan on using a heterogeneous correlation matrix to determine their relationship.

Progress has been strong, so looking forward where we stand next week!