Completion

For now, we have concluded our project. Before we present our findings, our group would like to thank the OpenAQ team for collecting the air quality data and being so quick to respond to our inquiries, as well as WeatherUnderground for their weather API. Of course, we would also like to thank Professor Kraska and all of the CS1951A TAs for all their help and insight during the semester.

Below, our findings:

Classification and Prediction of Outdoor Qualities Based on Weather Parameters

1. Abstract

Fine particulate pollution in the air is linked to severe cardiovascular and respiratory diseases. Air pollution is therefore closely linked to human and animal health. Modeling levels of air pollution can mitigate the adverse effects of high pollutant levels, but most current modeling only predicts at low resolution over long periods of time. Predicting air pollution in the immediate future allows people to minimize their contact to airborne pollutants. Thus, we utilized bulk air quality data and historical weather data to model how pollution levels correlate weather conditions.

We cleaned the collected air quality data from several cities, then merged it with historical weather data from Weather Underground. Using this data, we trained several machine learning classifiers: Logistic regression, linear svm, and random forest. From the resulting models, we found that NO2 and CO were able to be best predicted, while the other pollutants were predictable to a lesser degree. The most important factors for predicting air pollution were found to be wind speed and humidity. These results suggest that there does exist correlation between air quality and weather parameters, and weather parameters can serve as a predictor for air pollution.

2. Data Collection and Integration

First, we collected the air quality data from https://openaq.org/#/sources and the OpenAQ api. Using the API, we obtained the historical air quality readings for San Francisco, Houston, Boston, and New York. We separated these files by pollutant (NO2, PM25, etc.) and removed duplicates from the data.

Picture1

Then, we used the Weather Underground API (https://www.wunderground.com) to match up the collected air quality readings with historical weather data. The Weather Underground API returns the parameters weather condition, temperature, humidity, wind-speed, wind-gust, visibility, etc., which we saved into the same row as the pollution readings. “Getweather.py” is our script for getting data from weather API, and generating a csv file for weather data based on city.

Picture2After generating the first two files, we then used “merge.py” to merge the two datasets together and put them into a combined csv file based on city and air pollution parameters such as “bostonno2.csv”.

Picture3For the primary machine learning component, this was the extent of the data cleaning necessary.

However, when we investigated the relationship between percent change and weather, we found that additional data processing was necessary. After generating the paired weather data, we then parsed the date for each row and sorted on date. We then found the change between each row and divided it by the set average to yield percent change.

3. Hypothesis

Even prior to obtaining the data, we predicted that there was a relationship between relationship between air quality and weather on a day-to-day basis. Our hypothesis was that long term averages of air pollution would primarily be determined by external environmental factors (Hochadel et al.), but deviations from that average on the time scale of a day would be primarily influenced by weather factors.

Before we start to build any model on the relationship between weather features and air qualities using machine learning methods, we first draw the scatterplot matrix to examine the correlation among those weather features as well as the possible correlation between air quality and weather features. Figure 1 shows the scatterplot matrix of correlation between PM2.5 measurements and weather parameter pairs.

Picture4Figure 1. Scatterplot matrix of correlation between PM2.5 concentration and weather parameter pairs. Blue dots represent LOW (< 12 mg/m3), red dots represent HIGH (> 12 mg/m3).

It can be seen that except for the wind speed and wind gust pair, there is no obvious correlation between two weather features for most weather feature pairs, for which we can assume that those weather features are independent with each other. With respect to the correlation between PM2.5 measurements (with HIGH and LOW labels) and weather features, we found that the higher the wind speed/gust, the higher probability to find a LOW label (blue dots) than to find a HIGH label (red dots). There is no other particularly strong correlation between PM2.5 and other features. Therefore, we hypothesize that wind speed is an important feature influencing the PM2.5 content in air. Similar trend can also be found for other air qualities, e.g. CO, SO2, NO2, indicating that wind speed is an important feature for most air qualities.

Thus, we aim to verify the hypothesis that air pollution is correlated with weather and further examine the hypothesized strong link between wind speed and air quality. To do so, we employed various machine learning methods to determine the accuracy of prediction.

4. Methodology and Results

For the machine learning part of the project, we used four different methods to build models for the relationship between air quality and weather features: multiple regression, logistic regression, SVM and random forest.

4.1 Multiple Linear Regression

We performed multiple linear regression to model the relationship between a specific air quality (e.g. pm25, CO) and a series of weather variables (i.e. temperature, humidity, wind_speed, pressure, etc.). The L2 regularization and cross validation were used to prevent overfitting.

We first built the multiple linear regression model based on the linear predictor function, and performed the Stochastic Gradient Descent (SGD) to minimize the error term. The features are normalized into the 0~1 range by using the min-max normalization. With all seven features (weather condition, temperature, humidity, wind speed, wind gust, pressure and visibility) included in the model, the accuracy for CO prediction is 49% (with both prediction values and test y values rounded to 1 decimal place). Even with quadratic or cubic terms added to the model, the accuracy is not significantly improved (still below 55%). The accuracy of prediction on other air qualities (e.g. PM2.5, SO2, NO2) using multiple linear regression models are even lower, and for PM2.5 predictions, the accuracy is even lower than 5%.  Since it is possible that the air quality is more related to some of the weather features compared to others, we also tried Lasso regression model in sklearn package. Lasso is used to estimate sparse coefficients in cases where a solution with fewer features is preferred. L1 regularization is used in Lasso. With Lasso model, the accuracy for CO prediction (with cross validation number = 5) is below 37%.  The reason for the low accuracy of the multiple regression model might be that the distribution of the air quality values is broad and the variance is big, so it is hard to fit a single curve to those values. Therefore, it might be more feasible to perform classification on the air qualities by labeling each air quality with LOW or HIGH labels.

4.2 Logistic Regression

With the thought of performing classification on the air quality data, we first need to categorize our training data. Taking PM2.5 for example, according to US Environmental Protection Agency (EPA), when the content of pm2.5 in the air is larger than 12 mg/m3, it is considered to be harmful for human health. So we divide the pm2.5 data into two categories: HIGH (> 12 mg/m3) and LOW (<= 12 mg/m3). The PM2.5 data set is based on the PM2.5 measurements in 5 months at Houston, and the base rate of HIGH PM2.5 is about 35%. For the logistic regression model, all weather features are normalized into the 0~1 range by using the min-max normalization.

We first used the SGD method with L2 regularization to implement the logistic regression, and after tuning the parameters (i.e. threshold, α, and λ), we were able to get a training accuracy at 68.3%, and the confusion matrix is as follows:

________True HIGH__True LOW

Predict High____259____ 139

Predict Low____525_____1170

 

It can be seen that although the accuracy is a pretty decent number, the recall of this model is very low (about 33%). Since the number of LOW observations is much larger than that of HIGH observations, this data set is unbalanced, and it may cause the fact that the model tries to predict the label that accounts for the majority in the data set, which results in the low recall.  To solve the imbalance problem, we choose to use the LogisticRegressionCV classifier in sklearn.linear_model, and this classifier allows us to adjust weights inversely proportional to class frequencies in the data set, which balances the data. Still using L2 regularization, and setting the cross validation number to 5, the accuracy of the model with all 7 weather features is 63%, and the confusion matrix is as follows:

_______True HIGH__True LOW

Predict High____484____ 300

Predict Low____494_____815

It can be seen that although the accuracy is lower than the first model, the recall increases from 33% to 62%, and the current precision is about 50%.

To validate our original hypothesis that wind speed is an important weather feature for air quality prediction, we would like to compare how each single weather feature influences the air quality separately. Taking the PM2.5 predictions for example, first four models with single features (i.e. weather icon, wind speed, wind gust and temperature) were built, and the accuracies for the four models are: 0.53, 0.61, 0.46 and 0.58, respectively. The reason for the low accuracy of the wind gust model might be that a lot of data has NaN wind gust value, which were treated as 0 during data treatment. The receiver operating characteristic (ROC) plots (Figure 2) of the four models also show that the wind speed model has the largest area under the curve.

Picture5

Figure 2. Comparison of ROC plots for models with single feature

To further explore with the wind speed feature using logistic regression, we introduced polynomial terms into the model. The accuracies of the models with quadratic wind speed term and cubic wind speed term are 0.61 and 0.64, respectively, and they are not dramatically improved compared to the model with linear wind speed term (0.61). Also it can be seen from Figure 2 that, the ROC plots of the polynomial models completely superimpose with the linear model, suggesting that the polynomial terms did not obviously improve the model.

The models with multiple (linear) features are also compared, and the accuracies for those models are: 0.63 for model with all 7 features, 0.58 for model with 6 features (7 features except for the wind speed feature), 0.62 for model with 3 features (weather condition, wind speed, wind gust) and 0.61 for model with 2 features (weather icon, wind speed). The ROC curves for those models are shown in Figure 3. The results again indicate that the wind speed is the most important feature among all weather features for the PM2.5 model, and the addition of other features just slightly improves the model accuracy and area under curve (from 0.67 to 0.68).

Picture6

Figure 3. Comparison of ROC plots for models with different number of features

4.3 Support Vector Machine

First of all, we test our model on CO, because it is easy to be labeled. However, when we simply put data to train a multi labels SVM classifier, we cannot get a good classifier for the dataset as the accuracy is roughly 50%.

Therefore, we relabeled the parameters using average air quality value as a boundary. Also, we use kernel function to put features into a higher dimension to make the model more complex and accurate. We found radial basis SVM model gave the most accurate model ( below is the boundary of the training data for 2 features)

____Linear SVM model___Polynomial SVM model___Radial Basis SVM model

Picture7Picture8Picture9

Figure 4. SVM boundaries of different SVM models

Then, in order to avoid overfitting, we use k-fold Cross-Validation method. In this case, we create 5 folds and test accuracy for each fold and for each pollution, and we choose the model with the highest accuracy among different left out fold. The result is pretty good for O3, NO2 and CO, but a little bit worse for PM25 and PM10. In general, the models seem pretty good. The accuracies of these models are between 70% to 85%. In order to find out if the those classifiers are truly good or not, we plot the ROC curves for different models based on different number of features( below is the ROC curve of O3 SVM model).

Picture10

Figure 5. ROC plots for SVM models with different number of features

As we can see above, the model trained by radial basis SVM method is truly a good classifier. However, the biggest weakness for these models is that it can only predict if the value of air quality is above average or not. It cannot give the exact value of the air quality parameters. Therefore, we set SVM method as a backup plan.

4.4 Random Forest

Having gained some insight in the structure and nature of the data already, Regression Forest seemed to have great potential. Not only was the ability to predict actual values enticing, but the ability to handle dependence between features, durability under noise and the capacity to deal with features of different weights made Regression Forest a definite candidate.

Originally we assumed the features to be independent. This simplification makes life easier and is reasonable as we didn’t see any excessively strong correlations between the features. However, weather features conceivably are correlated, so, instead of trying to identify these potential correlations directly, which is outside of the scope of our paper, we used the Random Forest model to make the question a non-issue.

Random Forest is more robust under noise than adaboost (Breiman 16) and generally handles noise very well. Weather in general is extremely noisy: most weather data (unlike climate data) is considered de-facto random, more precisely, even though the system is deterministic, small changes in initially input give vastly different outputs: the system is chaotic (Lorenz). Pollution levels also have high variances from our own tests. Clearly, this doesn’t mean that there necessarily is a high level of noise for the correlation: weather and pollution could be so strongly correlated that random variations in one directly are reflected in the other, but the data we are dealing with has high propensity to noise so using a Regression algorithm that can deal with this makes sense.

Different weights between features is probably the most obvious concern in this case. From our previous algorithms and also simply from looking at the graphs, certain features, such as wind speed matter a lot while others, such as visibility, don’t. Below is a pie chart showing how the Regression Forest deals with this for NO2 and CO:

Picture11

Figure 6. Importance of each weather parameter to the air qualities (NO2 and CO content in air)

Note how Windspeed carries almost 50% of the weight for NO2. Models such as Naive Bayes, clearly can’t handle this well.

The final Regression Forest algorithm didn’t need too much fine tuning, though it needed some, but a lot of work went into testing different Random Forests, cleaning the data, and interpreting results. Originally we used a modified Random Forest that split the data based on homogeneity. This is how one builds a standard classification tree. This generally is meant to be used for discrete predictions. As we had enough data this works fine, but we soon realized that a Regression Forest was the correct route. Regression Forests work by calculating the SSE (sum of squared errors) at each split point between predictions and actual values. This SSE is then minimized. This way we are making continuous predictions instead of discrete ones. Leafs return distributions for a continuous output variable.

The nice thing about Regression Forest is that it does a lot of fine tuning itself, but the not so nice part is that it is somewhat computationally intensive. Some optimization actually went into this. Firstly, we split the algorithm over multiple cores. Luckily the scikit ensemble library contained the necessary functions for this already, so this boiled down to telling the algorithm to use all 4 logical cores (2 physical cores hyperthreaded). Other then that, we ran the data on subsets and pre-cleaned to an extent so that when running the same forest multiple times we didn’t spend the computation time cleaning every iteration. Computation time and memory became an even larger issue when we gained access to the entire pollution data set in the form of a single file, which was solved by taking subsets to be looked at.

Data cleaning for Random Forest had some unique aspects. Weather was converted to numerals: weather in the data was signified by strings such as “mostlycloudy” or “hazy” or “tstorms.” These were converted into numbers using a purposely written excel function. Excel was used at this point because we were working with smallish data sets at this point (on average 1800 readings) and allowed for very quick turn around time for throwing up visualizations and calculating things like variance and mean in order to understand what was going on. The weakness of Random Forests of being mostly a black box was thereby mitigated through excel. This insight turned out to be crucial. Originally, having set up Random Forest, we got a Forest that perfectly predicted pollution levels. This of course seemed impossible, and it was. It turned out that the dataset contained duplicate readings. We had manually split the data into a training and testing set and it turned out that the original data set contained each reading on average three times. We still don’t know why, but this was the case multiple times and was very easy to fix and wouldn’t have been easy to find without excel. Other cleaning was fairly standard, such as cleaning up null values and splitting the features from the pollution values.

Interpreting results was also interesting with Random Forest. We hand wrote some simple algorithms that we threw at it first to see where we stood. We calculated things like variance, mean and even produced a tsv file with predictions so we could physically look at the numbers and get a feel for them. The R squared value is especially relevant. R squared signifies good a predictor is in relation to the variance in the data. Different pollutants have quite different variances, so R squared allows us to compare the effectiveness of Random Forests on the different pollutants. In the process we calculated mean-squared error and some other parameters.

Results wise Regression Forest exceedingly well with nitrogen dioxide and carbon monoxide while falling somewhat flat with PM25 and ozone. R^2 values were on average .95, .90, .56, .43 respectively. .56 and .43 aren’t exactly bad values, clearly weather can predict pollution, but they are comparatively worse. The possible reasons for this are discussed in the conclusion section.

4.5 Percent Change

Due to the results of the other machine learning methods, we found that it was it was difficult to predict pm2.5, O3, and SO2. Thus, according to our hypothesis, we wanted to see if it would be able to predict pollution as a function of change instead. Furthermore, it made sense to calculate change as percent change against the average value of the set rather than percent change from the previous value because of the number of small values and zero values for pollution.

In order to accomplish this, the weather/air pollution data was combined on identical dates. Because some of the readings occurred on the same day, all of the pollution readings for one day were averaged together in order to make sure that the units of change/unit time were the same for each datapoint. Then, the entire file was sorted on the date which was parsed to a Python ‘datetime’ object, and each row was then given an additional ‘change’ parameter representing the difference between the current row and the previous row. Each of the change objects was then divided by the average of every pollution reading in the dataset in order to yield percent change.

We then performed machine learning on this data using a random forest with the same parameters as the previous model. Although this model was highly accurate for each individual dataset (R^2 ~ 0.97 for pm2.5) it was entirely inaccurate when we attempted to extend it to data from different cities. This was probably due in large part to the small size of the data, which was unavoidable because the majority of the air quality readings were restricted to the past few months. Though this was not as much of an issue for the first set of models due to the fact that multiple readings occurred per day, it caused the size of the data for the percent change calculation to be less than two hundred data points. This would definitely cause overfitting to occur, especially with the random forest, which explains why it was so inaccurate when it came to additional data.

5. Conclusions

Our results suggest that there is a correlation between air quality and weather, and that certain weather parameters can be used to predict air quality in the near future. Our results imply that the original hypothesis is true, and that the correlation we saw between wind speed and air quality also exists. The most important factor in predicting PM2.5, PM10, CO and NO2 according to the machine learning models is wind speed, while humidity was the most important factor for O3. These were the expected results, given the original correlation matrix. Ozone is most likely an outlier because of the close correlation between weather conditions tied to precipitation and ozone. Given that humidity is the parameter most closely tied to precipitation, it makes sense that humidity would be the strongest predictor for ozone levels.

6. Challenges and Future Direction

At the beginning, we had an issue with gathering the weather data. Due to restrictions of the Weather Underground API limiting the number of calls/minute, it would be extremely time consuming to pair all of the air quality readings with weather. However, we found that it was possible to avoid this issue temporarily by making a single API call for each day and accessing the per hour readings within the response JSON. Another issue was the previously noted one of fitting a model for the pollutants pm2.5, O3, and SO2. We found that the models generated for these pollutants had a lesser degree of accuracy, and attempted to see if calculating change from day to day would be allow for a greater degree of accuracy in predicting the amount of each pollutant. However, it turned out that the collected data previously used to generate models was insufficient after being averaged per day. This was not enough data to generate an accurate model, and caused overfitting to occur.

We hoped to resolve this issue by using the entirety of the bulk air quality data from OpenAQ. We were able to obtain this data by contacting the OpenAQ team, who sent us the entirety of the collected historical air quality. However, due to the size of the file and time constraints, we were unable to calculate a model for percent change on the entire dataset.

However, given time in the future, we would definitely test both sets of machine learning models on the bulk dataset. Another future direction to explore is increasing the usability of the web app. Though we currently predict the air quality for a specific input location, we could also store our predictions and test them against the accuracy of actual air quality readings. Both of these areas would allow us to refine our model and increase the accuracy of prediction.

7. Deliverables

At this point, we have completed about 115% of the work proposed in our proposal, including the visualization of data, machine learning model building and analysis, data prediction and implementation of a webpage allowing interaction with user and providing air quality prediction in the next few days.

 

8. Works Cited:

Breiman, L. (2001). Random Forests. Berkeley Statistics Department, 16.

Hochadel, M., et al. (2006). Predicting long-term average concentrations of traffic-related air pollutants using GIS-based information.

Lorenz, E. (1962). Deterministic Nonperiodic Flow. MIT, AMS. Online 2010.

Petzoldt, K. et al. (1994). Correlation between stratospheric temperature, total ozone, and tropospheric weather systems.

 

Third update: Smarter Machines and Better Algorithms

Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

This week we refined our Machine Learning algorithms and our models.  Based on the deeper insights we got from our data last week, and by improving on what we had developed, we’ve gotten much more nuanced and accurate results.

SVM didn’t seem too promising last week.  However, we made some changes and get much better results.  We did this through a few crucial steps. We increased the features we choose. In order to get rid of overfitting, we use k-fold Cross-Validation method. In this case, we create 5 folds and  test accuracy for each fold and for each pollution. We use the average pollution per unit as the boundary of labels, which mean if above average, the label is 1, else 0. For each dataset of pollution, we get an accuracy of prediction for the leaving out fold.  For PM10 and PM25, accuracies hit a low of 50, but range up to 75.  Clearly SVM isn’t the best choice here or they may be some specific refining necessary.  For other pollutants our numbers are generally above 70 and bordering on 80 % accuracy.  For example, for No2 in one dataset:

{‘rbf’: 0.76449912126537789, ‘linear’: 0.75395430579964851, ‘poly’: 0.73637961335676627}

To get a clearer picture between weather features and air qualities, we created a scatter-plot matrix to make it easy to check the pairwise correlations. We examined the impact of 7 features (weather icon, temperature, humidity, wind speed, wind gust, pressure and vism) on the pm2.5 content (label HIGH for content > 12 um/m3 and label LOW for content <= 12 um/m3). On the plots, the blue dots are for LOW pm2.5 content, and red dots are for HIGH content.

WeatherPrettyDots

First of all, it can be seen that the labels are unbalanced (blue dots are about two times amount of the red dots), so we need to take the imbalance of the data into account for the classification. Secondly, we can see that the data distribution is relatively random for the four features: temperature, humidity, pressure and visibility, while the labels turn to separate along the axis of the other three features: weather icon, wind speed and wind gust, especially the wind speed feature. It implies that we may pay more attention to later three features during the classification.

 

Logistic regression on pm2.5 data
For logistic regression, we used LogisticRegressionCV classifier with L2 regularization in sklearn.linear_model, which allows us compensate the imbalance of the data during fitting. To prove our intuitive thought obtained by observing the scatter-plot matrix, we first performed  logistic regression on single features (weather icon, wind speed, wind gust and temperature). The accuracies for the four models are: 0.53, 0.61, 0.46 and 0.58, respectively. The reason for the low accuracy of the wind gust model might be that a lot of data has NaN wind gust value, which were treated as 0 during data treatment.  This is possible to fix, but will need some thought. The receiver operating characteristic (ROC) plots () were also drawn for the four models, and it can be seen that the wind speed model has the largest area under the curve. Generally these results agree with our hypothesis that the wind speed is an important feature for the pm2.5 value.

Weather ROC Log 2

To further explore with the wind speed feature using logistic regression, we introduced polynomial terms into the model. The accuracies of the models with square wind_speed term and cubic wind_speed term are 0.61 and 0.64, respectively, and they are not dramatically improved compared to the model with linear wind speed term (0.61). Also it can be seen from the figure that the ROC curves of the polynomial models totally superimpose with the linear model, suggesting that the polynomial terms did not obviously improve the model.
The models with multiple (linear) features are also compared, and the accuracies for those models are: 0.63 for model with all 7 features, 0.58 for model with all 7 features except for the wind_speed feature,  0.62 for model with 3 features (weather_icon, wind_speed, wind_gust) and 0.61 for model with 2 features (weather_icon, wind_speed). The ROC curves for those models are shown in the figure below. The results again proves that the wind_speed is the most important feature for the pm2.5 model, and the addition of other features does improve the accuracy from 0.61 to 0.63, and improves the area under curve from 0.67 to 0.68.

Weather ROC log

We tested another machine learning tool this week: Random Forests.  Originally we ran into some issues (specifically the data that we tried to work on had duplicates within it).  Our results were giving 100% accuracy.  Furthermore, we had to convert values such as sunny, or cloudy into numbers.  For the Random Forests we included all elements except date and wind gusts.  Wind gusts have too many NaN’s, so likely would be counterproductive and date would need to be left to a later date and wasn’t that relevant to our main factor of interest (the weather).  Having a clean set of data, we ran a random forest.  It took some tweaking, but it runs well.  The cool thing about Random Forests is that it returns a value, so we have a precise prediction to measure against.  On average, the Random Forest was 32.42 % off with its predictions for pm25 (which is equivalent to 0.0048 ppm25’s).  Based on the variance in ppm day to day being fairly high, this is pretty good.  Further mathematical analysis will be necessary to confirm this however.  Also, methods could be further refined and tests on other pollutants could give better results.  Given by how much the SVM varied between different pollutants, this could be crucial.

This coming week we plan on finishing up our machine learning, specifically looking at things like cross-validity between different geographic locations (after normalization) of the same algorithms.  We also plan on focusing on our visualization, and making concrete progress in that space.

Second Update: Machine Learning Work

Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

We focus on the machine learning part this week. Playing around with different regression and classification methods, we are trying to find a proper model to best illustrate the relationship between the air quality and different weather features. Now let’s take a look at what we have gotten so far.

First, we tried to use the Support Vector Machine (SVM) method to analyze dataset, and we used the CO dataset (i.e., the CO content in air). We divide the labels into two parts, above 0.1 and below 0.1. First of all we simply trained the model using maximum margin method without kernel function. The algorithm gives us model as shown in the following graph:

svm1

Apparently, it is not a decent classifier, and the accuracy of training data is 0.6416. Therefore, we consider using kernel function to make features into higher dimension. We use polynomial kernel function and dataset to train model, and it gives us model as shown in the following graph:

svm2

The accuracy of training data for this model is 0.6243. The accuracy of training data is even lower than the first one. Therefore, we tried to use radial basis kernel function to train the model, and we get a new model that looks like this:

svm3

This model increases the training accuracy to 0.6655, but it is still low.

According to these models, we feel at least for this dataset supporting vector machine method might not be a good way to build up data model.

For another air quality parameter, pm2.5, we used the logistic regression to do the classification.

According to US Environmental Protection Agency (EPA), when the content of pm2.5 in the air is larger than 12 ug/m3, it is considered to be harmful for human health. So we divide the pm2.5 data into two classes: HIGH (> 12 ug/m3) and LOW (<= 12 ug/m3), and use logistic regression to classify the data. The data set is based on the air quality in 5 months at Houston, and the base rate of HIGH pm2.5 is about 35%. The weather features include temperature, humidity, windspeed, windgust, pressure and visibility. All features were normalized into the 0~1 range by using the min-max normalization.

Firstly, we used the SGD method with L2 regularization to implement the logistic regression, and after tuning the parameters (i.e. threshold, α, and λ), we were able to get a training accuracy at 68.3%, and the confusion matrix is as follows:

                             True HIGH        True LOW

Predict HIGH            259                  139

Predict LOW             525                  1170

It is not difficult to find that the recall of this model is pretty low (about 33%). Since the number of LOW observations is much larger than that of HIGH observations, this data set is unbalanced, and it may cause the fact that the model tries to predict the label that accounts for the majority in the data set, which results in the low recall.

To solve the imbalance problem, we chose to use the LogisticRegressionCV classifier in sklearn.linear_model, and this classifier allows us to adjust weights inversely proportional to class frequencies in the data set, which balances the data. We still use L2 regularization, and set the cross validation number to 5. The accuracy of the model fitting is 62%, and the confusion matrix is as follows:

                             True HIGH        True LOW

Predict HIGH            484                 494

Predict LOW             300                  815

It can be seen that although the accuracy is lower than the first model, but the recall increases from 33% to 62%, and the current precision is about 50%. The MSE for the cross validation fitting is 0.38.

To further explore the logistic regression method, we introduced polynomial terms into the model. With introduction of squared term, the cross validation accuracy increases to 65%, and the recall is 63% and the precision is 46%, and the MSE is 0.35. By further introducing the cubic terms into the model, the accuracy turns to 66% with 64% recall and 53% precision, and the MSE is 0.34. Generally, the introduction of the polynomial terms helps to slightly increase the accuracy and precision of the model without bringing in extra errors, but does it mean that the polynomial model is a better model for this application purpose? We still need other proof to support this, and hopefully we will get a conclusion in the next blog post.

                              Accuracy(%)       Recall(%)       Precision(%)      MSE

Linear                           62                     62                    50                  0.38

Square Poly                 65                     63                    46                  0.35

Cubic Poly                   66                     64                     53                 0.34

 

In addition, we also tested the correlation between changes in pollution and various weather parameters.

The output correlation matrix for the Newark NO2 readings using was:

                                     NO2 (ppm)
temp                          -0.01204190
hum                           0.09903328
windspeedmph      -0.09779279

For each of the different parameters:

Temperature:
t = -0.53709, df = 1989, p-value = 0.5913
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.05594222   0.03190488

Humidity:

t = 4.4385, df = 1989, p-value = 9.554e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.05534415 0.14234392

Windspeed:

t = -4.3824, df = 1989, p-value = 1.235e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.14111646 -0.05409528

These results are somewhat surprising, since they indicate that it is unlikely that humidity and windspeed are correlated with the amount of NO2 pollution in the air.

In the upcoming week, we hope to be able to generalize these findings over a larger dataset. Our goal is to append all of the files with the same pollutant of interest, convert all readings to the same unit, then calculate the average percent change (taken against an average of all readings from the same station), and correlate that with the same weather parameters.

Here are some scatter plots for the data mentioned above:

corre1

Picture2

corre3

For the O3 readings, similar results were observed:

                                         Change (ppm)
Temperature                -0.014148312
Humidity                      0.001023036
Wind Speed                 -0.092476728

corre4

corre5

 

First Update: A Change in Course and some cool Graphs

Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

Since our first blog post its been a while.  First things first: our project has shifted.  Tweet data fell through, and without tweets we had no way of the mood people where in.  We even tried collecting data ourselves.  The issue became that even though we could get thousands of tweets for a geographic location, we had no way of getting them over a longer time frame then a few weeks, so we had no way to filter out seasonal variation both in weather and in mood.  So we scrapped anything to do with sentiment and tweets and moved to pollution.  Pollution is great because it matters.  It effect peoples health and has serious environmental consequences.  Pollution is also great because we found a great data set: OpenAQ.  So now we have our data: OpenAQ and Weather Underground, and an idea.  This catches us up to where we were a week ago.

We set out this week with two goals in mind.  Firstly, we wanted to show that our data showed something.  We wanted to be able to throw up a graph and say: Here, see, we have something.  If we couldn’t manage this we would have to go with a backup plan.  Secondly, although we knew that we wanted to use a multiple linear regression model to model our data, we needed to find a way to apply weights in this model.  Weights are necessary because in all likelyhood, and as future investigation showed, all factors are not created equal.

So, visualization first.  We tried graphing everything against everything, for one collection station in Newark.  Naturally some graphs turned out like this (NO2 vs Temperature):

NO2(ppm) vs T (F)

Which is to be expected when you just throw up every weather factor and pollution measurement.  However, others give us much more promising graphs (NO2 vs Wind):

NO2(ppm) vs Wind (mph)

Hey! This means we have found something interesting and even have some graphs to show our TA when we meet up with him.

So, onto our modeling. We’ve decided to use the multiple linear regression to model the relationship between a specific air quality (e.g. pm25) and a series of independent weather variables (e.g. temperature, humidity, wind_speed, pressure, etc.).  Important note: these variable may very well not be independent.  For example, humidity and temperature are likely fairly correlated.  This potential double counting will naturally be addressed in the more refined model by creating a combined weather humidity factor.  For simplicity of explaining our approach, we’ll treat our “Factors” as independent.  

    We will first use the Stochastic Gradient Descent (SGD) to determine the individual weights associated with each weather feature, and we will use 90% of our collected data as training data and 10% as test data. Besides directly using SGD to determine the weights on all the weather parameters (the full model), it is possible that some weather parameters do not have too much effect on the specific air quality, and we want to remove those uninformative variables from our model, and only keep the relevant variables.

    So how do we determine which parameters are important and which parameters are uninformative? We will apply the “forward selection” for our variable selection. Specifically, we start with models with only one variable (e.g. wind_speed), and calculate the proportion of explained variation (PVE) and the Akaike Information Criterion (AIC) of each model. Then we choose the best model (with the highest PVE value and the lowest AIC value), and for the second run of selection, we add another variable (e.g. temperature) to this model to form a two-predictor model, and we compare the PVEs and AICs of all the possible two-predictor models (i.e. wind_speed + temperature, wind_speed + temperature, wind_speed + pressure, etc.). Then we choose the best model from the second run, and if this model has a higher AIC value than the single-variable model we obtained from the first run, indicating that this model probably overfits the data, we stop adding any more variable to the model, otherwise, we will perform the third run of selection, by adding one more variable to the model, and compare the PVEs and AICs of those new models. We stop the iteration until we find that the AIC increases by adding more variable, otherwise we add all the variables to the model.

Toward the end of this week we’ve discussed some more fine tuned aspects of this approach, including some very helpful input from our TA, and we’ve added some factors of consideration.  We could use the BIC instead of the AIC for assessing model fit for potentially better results because of better overfitting protection.  Other analysis that could make a large impact or otherwise may be useful is using z-scores to analyze wether our finding carry much relevance and using MDL for a potentially different approach.

    If the above models do not give a decent accuracy, we might want to introduce polynomials to the regression function, for example, some variables might need a square or even cubic term in the model.  Furthermore, we have plans to divide pollution into different zones, potentially based on health hazard, allowing us to use SVM or another classifier.  This could provide us with further insight and other interesting ways of visualizing our results.

With regards to statistically exploring our data, we calculated the change in pollution (ppm/other units) with regards to the previous pollution reading. Then, we plan to test the correlation between different weather events and the change that occurs in the pollution reading. Because this data is trying to find the correlation between categorical variables (weather) and continuous variables (pollution), we plan on using a heterogeneous correlation matrix to determine their relationship.

Progress has been strong, so looking forward where we stand next week!

Midterm Report

Authors: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)

1.  An introduction that discusses the data you are analyzing, and the question or questions you are investigating. You should be able to explain what your data looks like (words are fine, but visualizations are often better).

We are discussing the tweets data and weather data to find some relation between people’s emotion, location and weather. First of all, we get our raw data from twitter and http://www.wunderground.com/. For twitter data, since we are only able to get real-time streaming data, we have to manually collect data in different cities everyday. Secondly, we clean tweets and weather data, and merge them together, and the final data is in csv file, with columns of location, time, weather (including icon, temperature, humidity, etc) and twitter text. Thirdly, we use the way we learnt in the machining learning lab to predict labels (1 or 0) for each tweet. Then we did some trials by using the labeled tweets and weather data as the training data, and use both Naive Bayes and Logistic Regression methods to predict the relation between weather and the label. In the next step, we want to explore more on the data by employing some other machine learning technology, e.g. topic modeling, to predict people’s emotion more precisely. Eventually, we hope to use the results we obtain for some recommendation system, for example, we can recommend people music of different genres based on their emotions under a specific weather.

2. At least one visualization that tests an interesting hypothesis, along with an explanation about why you thought this was an interesting hypothesis to investigate.

Our hypothesis is that we can estimate the mood of a person by analyzing the sentiment of  the tweets he/she published, and we suppose that people’s mood can be influenced by weather. Of course, people in different countries, states or cities may have different mood even with the same weather. So the purpose of our current work is to investigate the relationship between people’s mood and their locations as well as the local weathers.

Figure 1

Figure 1 shows the mood distribution on different cities. The vertical axis is the ratio between the number of positive tweets (with label 1) and the number of negative tweets (with label 0), and the higher the ratio, the “happier” the people in that city.

Figure 2

Figure 2 adds weather factor into consideration. Since we have limited amount of data (we have tweets only on five days), only two weather categories (clear and cloudy) are involved in this figure, and still we lack the tweets data for Houston when the weather is clear. The number shown on each tile (shows up upon hovering) is still the ratio between positive and negative tweets.

3.  At least one result that came from applying machine learning algorithms to your data, along with an accompanying interpretation.

(1) We employed both the training tweets data and test tweets data provided in the machine learning lab as our training data, and labeled our own tweets using support vector machine method.

(2) Using the labeled results of part (1), we were able to use a portion of our collected data as a training data set in order to generate a prediction for sentiment given certain weather conditions. We used both the naive Bayes and logistic methods on a set of test data to predict sentiment labels, with the temperature and humidity as features.

Classification results for part (2):

Logistic classifier:

Training accuracy: 0.627422156889

CITY: chicago

TWEET COUNT: 14966

MEAN: 0.627422224127

STD: 0.000205408220047

Features Coefficients:

[[ 0.01721465 -0.02651785 -0.001266  ]]

[ 0.00203613]

 

Naive Bayes:

Training accuracy: 0.372577843111

CITY: chicago

TWEET COUNT: 14966

MEAN: 0.372577775873

STD: 0.000205408220047

Features Coefficients:

[[ 3.85861908  2.14773783  4.00537603]]

[-0.46613567]

From the result, it can be seen that the accuracy of the prediction is pretty low, and one possible reason might be that our training data set is not large enough. It is also possible that the data itself is not suitable to be classified using the above classifiers. We will continue to explore on the improvement of the classification accuracy if we are able to get larger (historical) data set.

4.  A discussion of the following:

  • What is hardest part of the project that you’ve encountered so far?

The most difficult issue that we’ve encountered so far is the problem of collecting/accessing the appropriate data for our project. So far, we’ve been focusing on correlating weather with music/tweets, but both of those have the issue that there are no large historical repositories of data. For the music streaming, this is a privacy issue, while for Twitter, it is because the company does not want outside providers of tweets. Without large amounts of historical data, it is difficult to correlate a long term trend (ie. weather) with the tweet data that we collected.

  • What are your initial insights?

From what we can see so far, there is a certain degree of correlation between weather and the average sentiment of tweets. This can be seen in the fact that we are able to find specific tweets which reference the good weather, terrible weather. However, on a larger scale, we are trying to see if there is a larger correlation between weather and the mood of all tweets. With regards to this question, the connection is much more tenuous, though there appears to be some small correlation. However, it is difficult to know if this is due to random variation, etc, especially given the fact that the modelling data we are working with does not span a particularly long duration.

  • Are there any concrete results you can show at this point? If not, why not?

The current results we have are our collected data which has been sorted into sentiment values by the content of the tweet, the visualized relationships between mood, location and weathers based on the entirety of data we collected, and the weather model that we determined with that data. These weather models are created for each city that we collected data from, and can predict, given the weather, what proportion of tweets will be negative/positive.

  • Going forward, are the current biggest problems you’re facing?

At this time, and in the future, our biggest problem will be finding and verifying a statistically significant correlation between the weather and tweet sentiment (or lack thereof). This is due to the initial issue that we had with collecting data, as that introduces much more variation into our data analysis.

  • Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

We are on track with our product given the data that we are working with. We have primary visualizations, and have used machine learning to make a set of unique tweet sentiment predictors given location. This is exactly where we planned to be with our project.  

  • Given your initial exploration of the data, is it worth proceeding with your project?

We feel that the idea of the project is good, as are the applications that we have in mind, but the fundamental lack of data regarding tweets and/or music streaming for a long duration in the past makes it very difficult to draw any conclusions. We think that it is only worth proceeding with the project if we are able to obtain a significantly larger (historical) data set.