Third update: Smarter Machines and Better Algorithms

Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

This week we refined our Machine Learning algorithms and our models.  Based on the deeper insights we got from our data last week, and by improving on what we had developed, we’ve gotten much more nuanced and accurate results.

SVM didn’t seem too promising last week.  However, we made some changes and get much better results.  We did this through a few crucial steps. We increased the features we choose. In order to get rid of overfitting, we use k-fold Cross-Validation method. In this case, we create 5 folds and  test accuracy for each fold and for each pollution. We use the average pollution per unit as the boundary of labels, which mean if above average, the label is 1, else 0. For each dataset of pollution, we get an accuracy of prediction for the leaving out fold.  For PM10 and PM25, accuracies hit a low of 50, but range up to 75.  Clearly SVM isn’t the best choice here or they may be some specific refining necessary.  For other pollutants our numbers are generally above 70 and bordering on 80 % accuracy.  For example, for No2 in one dataset:

{‘rbf’: 0.76449912126537789, ‘linear’: 0.75395430579964851, ‘poly’: 0.73637961335676627}

To get a clearer picture between weather features and air qualities, we created a scatter-plot matrix to make it easy to check the pairwise correlations. We examined the impact of 7 features (weather icon, temperature, humidity, wind speed, wind gust, pressure and vism) on the pm2.5 content (label HIGH for content > 12 um/m3 and label LOW for content <= 12 um/m3). On the plots, the blue dots are for LOW pm2.5 content, and red dots are for HIGH content.


First of all, it can be seen that the labels are unbalanced (blue dots are about two times amount of the red dots), so we need to take the imbalance of the data into account for the classification. Secondly, we can see that the data distribution is relatively random for the four features: temperature, humidity, pressure and visibility, while the labels turn to separate along the axis of the other three features: weather icon, wind speed and wind gust, especially the wind speed feature. It implies that we may pay more attention to later three features during the classification.


Logistic regression on pm2.5 data
For logistic regression, we used LogisticRegressionCV classifier with L2 regularization in sklearn.linear_model, which allows us compensate the imbalance of the data during fitting. To prove our intuitive thought obtained by observing the scatter-plot matrix, we first performed  logistic regression on single features (weather icon, wind speed, wind gust and temperature). The accuracies for the four models are: 0.53, 0.61, 0.46 and 0.58, respectively. The reason for the low accuracy of the wind gust model might be that a lot of data has NaN wind gust value, which were treated as 0 during data treatment.  This is possible to fix, but will need some thought. The receiver operating characteristic (ROC) plots () were also drawn for the four models, and it can be seen that the wind speed model has the largest area under the curve. Generally these results agree with our hypothesis that the wind speed is an important feature for the pm2.5 value.

Weather ROC Log 2

To further explore with the wind speed feature using logistic regression, we introduced polynomial terms into the model. The accuracies of the models with square wind_speed term and cubic wind_speed term are 0.61 and 0.64, respectively, and they are not dramatically improved compared to the model with linear wind speed term (0.61). Also it can be seen from the figure that the ROC curves of the polynomial models totally superimpose with the linear model, suggesting that the polynomial terms did not obviously improve the model.
The models with multiple (linear) features are also compared, and the accuracies for those models are: 0.63 for model with all 7 features, 0.58 for model with all 7 features except for the wind_speed feature,  0.62 for model with 3 features (weather_icon, wind_speed, wind_gust) and 0.61 for model with 2 features (weather_icon, wind_speed). The ROC curves for those models are shown in the figure below. The results again proves that the wind_speed is the most important feature for the pm2.5 model, and the addition of other features does improve the accuracy from 0.61 to 0.63, and improves the area under curve from 0.67 to 0.68.

Weather ROC log

We tested another machine learning tool this week: Random Forests.  Originally we ran into some issues (specifically the data that we tried to work on had duplicates within it).  Our results were giving 100% accuracy.  Furthermore, we had to convert values such as sunny, or cloudy into numbers.  For the Random Forests we included all elements except date and wind gusts.  Wind gusts have too many NaN’s, so likely would be counterproductive and date would need to be left to a later date and wasn’t that relevant to our main factor of interest (the weather).  Having a clean set of data, we ran a random forest.  It took some tweaking, but it runs well.  The cool thing about Random Forests is that it returns a value, so we have a precise prediction to measure against.  On average, the Random Forest was 32.42 % off with its predictions for pm25 (which is equivalent to 0.0048 ppm25’s).  Based on the variance in ppm day to day being fairly high, this is pretty good.  Further mathematical analysis will be necessary to confirm this however.  Also, methods could be further refined and tests on other pollutants could give better results.  Given by how much the SVM varied between different pollutants, this could be crucial.

This coming week we plan on finishing up our machine learning, specifically looking at things like cross-validity between different geographic locations (after normalization) of the same algorithms.  We also plan on focusing on our visualization, and making concrete progress in that space.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s