Authors: Eric Liu (eliu7), Felix Merk (fmerk), Yijun Shao (yshao3), Ying Su (ysu1)
1. An introduction that discusses the data you are analyzing, and the question or questions you are investigating. You should be able to explain what your data looks like (words are fine, but visualizations are often better).
We are discussing the tweets data and weather data to find some relation between people’s emotion, location and weather. First of all, we get our raw data from twitter and http://www.wunderground.com/. For twitter data, since we are only able to get real-time streaming data, we have to manually collect data in different cities everyday. Secondly, we clean tweets and weather data, and merge them together, and the final data is in csv file, with columns of location, time, weather (including icon, temperature, humidity, etc) and twitter text. Thirdly, we use the way we learnt in the machining learning lab to predict labels (1 or 0) for each tweet. Then we did some trials by using the labeled tweets and weather data as the training data, and use both Naive Bayes and Logistic Regression methods to predict the relation between weather and the label. In the next step, we want to explore more on the data by employing some other machine learning technology, e.g. topic modeling, to predict people’s emotion more precisely. Eventually, we hope to use the results we obtain for some recommendation system, for example, we can recommend people music of different genres based on their emotions under a specific weather.
2. At least one visualization that tests an interesting hypothesis, along with an explanation about why you thought this was an interesting hypothesis to investigate.
Our hypothesis is that we can estimate the mood of a person by analyzing the sentiment of the tweets he/she published, and we suppose that people’s mood can be influenced by weather. Of course, people in different countries, states or cities may have different mood even with the same weather. So the purpose of our current work is to investigate the relationship between people’s mood and their locations as well as the local weathers.
Figure 1 shows the mood distribution on different cities. The vertical axis is the ratio between the number of positive tweets (with label 1) and the number of negative tweets (with label 0), and the higher the ratio, the “happier” the people in that city.
Figure 2 adds weather factor into consideration. Since we have limited amount of data (we have tweets only on five days), only two weather categories (clear and cloudy) are involved in this figure, and still we lack the tweets data for Houston when the weather is clear. The number shown on each tile (shows up upon hovering) is still the ratio between positive and negative tweets.
3. At least one result that came from applying machine learning algorithms to your data, along with an accompanying interpretation.
(1) We employed both the training tweets data and test tweets data provided in the machine learning lab as our training data, and labeled our own tweets using support vector machine method.
(2) Using the labeled results of part (1), we were able to use a portion of our collected data as a training data set in order to generate a prediction for sentiment given certain weather conditions. We used both the naive Bayes and logistic methods on a set of test data to predict sentiment labels, with the temperature and humidity as features.
Classification results for part (2):
Training accuracy: 0.627422156889
TWEET COUNT: 14966
[[ 0.01721465 -0.02651785 -0.001266 ]]
Training accuracy: 0.372577843111
TWEET COUNT: 14966
[[ 3.85861908 2.14773783 4.00537603]]
From the result, it can be seen that the accuracy of the prediction is pretty low, and one possible reason might be that our training data set is not large enough. It is also possible that the data itself is not suitable to be classified using the above classifiers. We will continue to explore on the improvement of the classification accuracy if we are able to get larger (historical) data set.
4. A discussion of the following:
- What is hardest part of the project that you’ve encountered so far?
The most difficult issue that we’ve encountered so far is the problem of collecting/accessing the appropriate data for our project. So far, we’ve been focusing on correlating weather with music/tweets, but both of those have the issue that there are no large historical repositories of data. For the music streaming, this is a privacy issue, while for Twitter, it is because the company does not want outside providers of tweets. Without large amounts of historical data, it is difficult to correlate a long term trend (ie. weather) with the tweet data that we collected.
- What are your initial insights?
From what we can see so far, there is a certain degree of correlation between weather and the average sentiment of tweets. This can be seen in the fact that we are able to find specific tweets which reference the good weather, terrible weather. However, on a larger scale, we are trying to see if there is a larger correlation between weather and the mood of all tweets. With regards to this question, the connection is much more tenuous, though there appears to be some small correlation. However, it is difficult to know if this is due to random variation, etc, especially given the fact that the modelling data we are working with does not span a particularly long duration.
- Are there any concrete results you can show at this point? If not, why not?
The current results we have are our collected data which has been sorted into sentiment values by the content of the tweet, the visualized relationships between mood, location and weathers based on the entirety of data we collected, and the weather model that we determined with that data. These weather models are created for each city that we collected data from, and can predict, given the weather, what proportion of tweets will be negative/positive.
- Going forward, are the current biggest problems you’re facing?
At this time, and in the future, our biggest problem will be finding and verifying a statistically significant correlation between the weather and tweet sentiment (or lack thereof). This is due to the initial issue that we had with collecting data, as that introduces much more variation into our data analysis.
- Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?
We are on track with our product given the data that we are working with. We have primary visualizations, and have used machine learning to make a set of unique tweet sentiment predictors given location. This is exactly where we planned to be with our project.
- Given your initial exploration of the data, is it worth proceeding with your project?
We feel that the idea of the project is good, as are the applications that we have in mind, but the fundamental lack of data regarding tweets and/or music streaming for a long duration in the past makes it very difficult to draw any conclusions. We think that it is only worth proceeding with the project if we are able to obtain a significantly larger (historical) data set.