Group members: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)
Since our first blog post its been a while. First things first: our project has shifted. Tweet data fell through, and without tweets we had no way of the mood people where in. We even tried collecting data ourselves. The issue became that even though we could get thousands of tweets for a geographic location, we had no way of getting them over a longer time frame then a few weeks, so we had no way to filter out seasonal variation both in weather and in mood. So we scrapped anything to do with sentiment and tweets and moved to pollution. Pollution is great because it matters. It effect peoples health and has serious environmental consequences. Pollution is also great because we found a great data set: OpenAQ. So now we have our data: OpenAQ and Weather Underground, and an idea. This catches us up to where we were a week ago.
We set out this week with two goals in mind. Firstly, we wanted to show that our data showed something. We wanted to be able to throw up a graph and say: Here, see, we have something. If we couldn’t manage this we would have to go with a backup plan. Secondly, although we knew that we wanted to use a multiple linear regression model to model our data, we needed to find a way to apply weights in this model. Weights are necessary because in all likelyhood, and as future investigation showed, all factors are not created equal.
So, visualization first. We tried graphing everything against everything, for one collection station in Newark. Naturally some graphs turned out like this (NO2 vs Temperature):
Which is to be expected when you just throw up every weather factor and pollution measurement. However, others give us much more promising graphs (NO2 vs Wind):
Hey! This means we have found something interesting and even have some graphs to show our TA when we meet up with him.
So, onto our modeling. We’ve decided to use the multiple linear regression to model the relationship between a specific air quality (e.g. pm25) and a series of independent weather variables (e.g. temperature, humidity, wind_speed, pressure, etc.). Important note: these variable may very well not be independent. For example, humidity and temperature are likely fairly correlated. This potential double counting will naturally be addressed in the more refined model by creating a combined weather humidity factor. For simplicity of explaining our approach, we’ll treat our “Factors” as independent.
We will first use the Stochastic Gradient Descent (SGD) to determine the individual weights associated with each weather feature, and we will use 90% of our collected data as training data and 10% as test data. Besides directly using SGD to determine the weights on all the weather parameters (the full model), it is possible that some weather parameters do not have too much effect on the specific air quality, and we want to remove those uninformative variables from our model, and only keep the relevant variables.
So how do we determine which parameters are important and which parameters are uninformative? We will apply the “forward selection” for our variable selection. Specifically, we start with models with only one variable (e.g. wind_speed), and calculate the proportion of explained variation (PVE) and the Akaike Information Criterion (AIC) of each model. Then we choose the best model (with the highest PVE value and the lowest AIC value), and for the second run of selection, we add another variable (e.g. temperature) to this model to form a two-predictor model, and we compare the PVEs and AICs of all the possible two-predictor models (i.e. wind_speed + temperature, wind_speed + temperature, wind_speed + pressure, etc.). Then we choose the best model from the second run, and if this model has a higher AIC value than the single-variable model we obtained from the first run, indicating that this model probably overfits the data, we stop adding any more variable to the model, otherwise, we will perform the third run of selection, by adding one more variable to the model, and compare the PVEs and AICs of those new models. We stop the iteration until we find that the AIC increases by adding more variable, otherwise we add all the variables to the model.
Toward the end of this week we’ve discussed some more fine tuned aspects of this approach, including some very helpful input from our TA, and we’ve added some factors of consideration. We could use the BIC instead of the AIC for assessing model fit for potentially better results because of better overfitting protection. Other analysis that could make a large impact or otherwise may be useful is using z-scores to analyze wether our finding carry much relevance and using MDL for a potentially different approach.
If the above models do not give a decent accuracy, we might want to introduce polynomials to the regression function, for example, some variables might need a square or even cubic term in the model. Furthermore, we have plans to divide pollution into different zones, potentially based on health hazard, allowing us to use SVM or another classifier. This could provide us with further insight and other interesting ways of visualizing our results.
With regards to statistically exploring our data, we calculated the change in pollution (ppm/other units) with regards to the previous pollution reading. Then, we plan to test the correlation between different weather events and the change that occurs in the pollution reading. Because this data is trying to find the correlation between categorical variables (weather) and continuous variables (pollution), we plan on using a heterogeneous correlation matrix to determine their relationship.
Progress has been strong, so looking forward where we stand next week!