*A look at trends in Hedonometer.org's happiness data using Julia, JuMP, and GLPK.*

#### Introduction

Hedonometer.org popped onto my radar a couple weeks ago. It's a nifty project, attempting to convert samples of words found in the Twitter Gardenhose feed into a time series of happiness.

While I'm not a computational social scientist, I must say the data does have a nice intuitive quality to it. There are obvious trends in happiness associated with major holidays, days of the week, and seasons. It seems like the sort of data that could be decomposed into trends based on those various components. The Hedonometer group has, of course, done extensive analyses of their own data which you can find on their papers page.

This post examines another approach. It follows the structure of Robert Vanderbei's excellent "Local Warming" project to separate out the Hedonometer averages into daily, seasonal, solar, and day-of-the-week trends. We'll be using Julia v0.2 with JuMP and GLPK for linear optimization, Gadfly for graphing, and a few other libraries. If you haven't installed Julia, first do that. Then add all the extra requirements in the Julia shell as shown below.

Pkg.add("IJulia") Pkg.add("DataFrames") Pkg.add("Datetime") Pkg.add("Gadfly") Pkg.add("JuMP") Pkg.add("GLPKMathProgInterface") Pkg.add("HypothesisTests")

Alternatively, if you'd rather use an IJulia notebook for reading this post, you can download it here.

#### Data

Hedonometer.org introduced an API recently, but as this post was started before that happened, we'll work with the CSV data. For the time being, you can still download it here. The ambitious reader is referred to the exercise at the end of the page regarding their API.

We use the readtable function from the DataFrames package to import the data. One oddity that this function doesn't seem to deal with yet is that the header has three columns while the rest of the CSV has two. So we skip the header and add column names in later. *x1* is the date column and *x2* is the happiness value, in Hedonometer units.

A brief inspection shows that we are not missing any data, so we don't have to merge in a list of dates with NAs.

using DataFrames tmpData = readtable("dailyFull.csv", header=false, skipstart=1) head(tmpData)

6x2 DataFrame: x1 x2 [1,] "2008-09-10" 6.01006 [2,] "2008-09-11" 6.00417 [3,] "2008-09-12" 6.01726 [4,] "2008-09-13" 6.02889 [5,] "2008-09-14" 6.03155 [6,] "2008-09-15" 5.98864

Note that the dates are strings. We vectorize the date constructor and use it to build a proper data frame.

import Datetime: date @vectorize_1arg String date data = DataFrame(date=date(tmpData[:, :x1]), value=tmpData[:, :x2]) head(data)

6x2 DataFrame: date value [1,] 2008-09-10 6.01006 [2,] 2008-09-11 6.00417 [3,] 2008-09-12 6.01726 [4,] 2008-09-13 6.02889 [5,] 2008-09-14 6.03155 [6,] 2008-09-15 5.98864

Let's start by making a plot of the raw data. This should look similar to the graph on their website, only not as pretty.

using Gadfly set_default_plot_size(24cm, 12cm) colorGradient = Scale.lab_gradient(color("crimson"), color("blue"), color("greenyellow")) plot( data, x = "date", y = "value", color = "value", Guide.xlabel("Date"), Guide.ylabel("Hedonomoeter.org Units"), Guide.title("Average Happiness on Twitter"), Scale.ContinuousColorScale(colorGradient) )

The data looks right, so we're off to a good start. Now we have to think about what sort of components we believe are important factors to this index. We'll start with the same ones as in the Vanderbei model:

- A linear happiness trend describing how our overall happiness changes over time.
- Seasonal trends accounting for mood changes with weather.
- Solar cycle trends.

We'll add to this weekly trends, as zooming into the data shows we tend to be happier on the weekends than on work days. In the next section we'll build a model to separate out the effects of these trends on the Hedonometer.org index.

#### Model

Vanderbei's model analyzes daily temperature data for a particular location using least absolute deviations (LAD). This is similar to the well-known least squares approach, but while the latter penalizes the model quadratically more for bigger errors, the former does not. In mathematical notation, the least squares model takes in a known $m \times n$ matrix $A$ and $m \times 1$ vector $y$ of observed data, then searches for a vector $x$ such that $Ax = \hat{y}$ and $\sum_i \left\lVert y_i - \hat{y}_i \right\rVert_2^2$ is minimized.

The LAD model is similar in that it takes in the same data, but instead of minimizing the sum of the squared $L^2$ norms, it minimizes the sum of the $L^1$ norms. Thus we penalize our model using simply the absolute values of its errors instead of their squares. This makes the LAD model more *robust*, that is, less sensitive to outliers in our input data.

Using a robust model with this data set makes sense becuase it clearly contains a lot of outliers. While some of them, such as December 25th, may be recurrent, we're going to ignore that detail for now. After all, not every day is Christmas.

We formulate our model below using JuMP with GLPK as the solver. The code works by defining a set of variables called *coefficientVars* that will converge to optimal values for $x$. For each observation we compute a row of the $A$ matrix that has the following components:

- Linear daily trend ($a_1$ = day number in the data set)
- Seasonal variation: $\cos(2\, \pi\, a_1 / 365.25)$ and $\sin(2\, \pi\, a_1 / 365.25)$
- Solar cycle variation: $\cos(2\, \pi\, a_1 / (10.66 \times 365.25))$ and $\sin(2\, \pi\, a_1 / (10.66 \times 365.25))$
- Weekly variation: $\cos(2\, \pi\, a_1 / 7)$ and $\sin(2\, \pi\, a_1 / 7)$

We then add a linear variable representing the residual, or error, of the fitted model for each observation. Constraints enforce that these variables always take the absolute values of those errors.

Minimizing the sum of those residuals gives us a set of eight coefficients for the model. We return these and a function that predicts the happiness level for an offset from the first data record. (Note that the first record appears to be from Wednesday, September 10, 2008.)

using JuMP using GLPKMathProgInterface function buildPredictor(d) m = Model(solver=GLPKSolverLP()) # Define a linear variable for each of our regression coefficients. # Note that by default, JuMP variables are unrestricted in sign. @defVar(m, coefficientVars[1:8]) # Residuals are the absolute values of the error comparing our # observed and fitted values. residuals = Array(Variable, nrow(d)) # This builds rows for determining fitted values. The first value is # 1 since it is multiplied by our our trend line's offset. The other # values correpond to the trends described above. Sinusoidal elements # have two variables with sine and cosine terms. function buildRowConstants(a1) [ 1, # Offset a1, # Daily trend cos(2pi*a1/365.25), # Seasonal variation sin(2pi*a1/365.25), # cos(2pi*a1/(10.66*365.25)), # Solar cycle variation sin(2pi*a1/(10.66*365.25)), # cos(2pi*a1/7), # Weekly variation sin(2pi*a1/7) # ] end # This builds a linear expression as the dot product of a row's # constants and the coefficient variables. buildRowExpression(a1) = dot(buildRowConstants(a1), coefficientVars) # Add a variable representing the absolute value of each error. @defVar(m, residuals[1:nrow(d)] >= 0) for a1 = 1:nrow(d) fitted = buildRowExpression(a1) # Linear way of representing residual >= |observed - fitted| @addConstraint(m, residuals[a1] >= fitted - d[a1,:value]) @addConstraint(m, residuals[a1] >= d[a1,:value] - fitted) end # Minimize the total sum of these residuals. @setObjective(m, Min, sum(residuals)) solve(m) # Return the model coefficients and a function that predicts happiness # for a given day, by index from the start of the data set. coefficients = getValue(coefficientVars)[:] # And we would like our model to work over vectors. predictHappiness(a1) = dot(buildRowConstants(a1), coefficients) return coefficients, predictHappiness end coefficients, predictor = buildPredictor(data) coefficients

8-element Array{Float64,1}: 5.92575 9.88082e-5 -0.00144717 -0.000969706 0.125979 -0.0184735 -0.00210457 -0.0135585

The optimal values for $x$ are output above. The second value is the change in happiness per day. We can see from this that there does seem to be a small positive trend.

We can call our predictor function to obtain the fitted happiness level for any day number starting from September 10, 2008.

predictor(1000)

6.011059321544039

#### Confidence Intervals

We now have a set of coefficients and a predictive model. That's nice, but we'd like to have some sense of a reasonable range on our model's coefficients. For instance, how certain are we that our daily trend is really even positive? To deal with these uncertanties, we use a method called bootstrapping.

Bootstrapping involves building fake observed data based on our fitted model and its associated errors. We then fit the model to our fake data to determine new coefficients. If we repeat this enough times, we may be able to generate decent confidence intervals around our model coefficients.

First step: compute the errors between the observed and fitted data. We'll construct a new data frame that contains everything we need to construct fake data.

# Compute fitted data corresponding to our observations and their associated errors. fitted = DataFrame( date = data[:, :date], observed = data[:, :value], fitted = map(x -> predictor(x), 1:nrow(data)), error = map(x -> predictor(x) - data[x, :value], 1:nrow(data)) ) describe(fitted)

date Length 2120 Type Date{ISOCalendar} NAs 0 NA% 0.0% Unique 2119 observed Min 5.881438 1st Qu. 5.97705775 Median 6.0043168 Mean 6.0123386060849064 3rd Qu. 6.041977749999999 Max 6.369217 NAs 0 NA% 0.0% fitted Min 5.952580232891465 1st Qu. 5.979249825128504 Median 6.004645249686684 Mean 6.009750200106241 3rd Qu. 6.0430211895222055 Max 6.074612040265234 NAs 0 NA% 0.0% error Min -0.32526218693298237 1st Qu. -0.012411496422179535 Median 0.0 Mean -0.002588405978665409 3rd Qu. 0.011507686354999658 Max 0.13504237942082575 NAs 0 NA% 0.0%

Note that the median for our errors is exactly zero. This is a good sign.

Now we build a function that creates a fake input data set using the fitted values with randomonly selected errors. That is, for each observation, we add a randomly selected error with replacement to its corresponding fitted value. Once we've done that for every observation, we have a complete fake data set.

function buildFakeData(fitted) errorIndexes = rand(1:nrow(fitted), nrow(fitted)) DataFrame( date = fitted[:, :date], value = map(x -> fitted[x, :fitted] - fitted[errorIndexes[x], :error], 1:nrow(fitted)) ) end # Plot some fake data to see if it looks similar. plot( buildFakeData(fitted), x = "date", y = "value", color = "value", Guide.xlabel("Date"), Guide.ylabel("Hedonomoeter.org Units"), Guide.title("Average Happiness on Twitter (Fake)"), Scale.ContinuousColorScale(colorGradient) )

Visually, the plot of an arbitrary fake data set looks a lot like our original data, but not exactly.

Now we generate 1,000 fake data sets and run them through our optimization function above. This generates 1,000 sets of model coefficients and then computes $2\sigma$ confidence intervals around them.

The following code took just under an hour on my machine. If you're intent on running it yourself, you may want to get some coffee or a beer in the meantime.

using HypothesisTests coeffData = [buildPredictor(buildFakeData(fitted))[1] for i = 1:1000] confidence2Sigma = Array((Float64, Float64), length(coefficients)) for i = 1:length(coefficients) sample::Array{Float64} sample = [x[i] for x in coeffData] confidence2Sigma[i] = ci(OneSampleTTest(sample)) end confidence2Sigma

8-element Array{(Float64,Float64),1}: (5.925619370480642,5.926250874471983) (9.833341267642056e-5,9.891872644950681e-5) (-0.0014847851702785182,-0.0014095257149699567) (-0.001010722418756523,-0.0009395274541342083) (0.12556589790988862,0.12604831225962906) (-0.018511935408237066,-0.018348960520221922) (-0.002133611417832446,-0.002058047348615918) (-0.01359168539278434,-0.01351903662344336)

#### Results

From the above output we can tell that we do seem to be trending happier, with a daily trend of 9.88082e-5 in Hedonometer units and a 95% confidence interval on that trend of 9.833341267642056e-5, and 9.891872644950681e-5. Cool!

Now we take a quick look at our model output. First, we plot the fitted happiness values for the same time period as the observed data. We can see that this resembles the same general trend minus the outliers. The width of the curve is due to weekly variation.

plot( fitted, x = "date", y = "fitted", Guide.xlabel("Date"), Guide.ylabel("Hedonomoeter.org Units"), Guide.title("Expected Happiness on Twitter"), Geom.line )

Now we take a look at what a typical week looks like in terms of its effect on our happiness. As September 10, 2008 was a Wednesday, we index Sunday starting at 5. The resulting graph highlights what I think we all already know about the work week.

daily(a1) = coefficients[6]*cos(2pi*a1/7) + coefficients[7]*sin(2pi*a1/7) plot( x = ["Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"], y = map(daily, [5, 6, 7, 1, 2, 3, 4]), Guide.xlabel("Day of the Week"), Guide.ylabel("Hedonomoeter.org Units"), Guide.title("Variation in Happiness on Twitter Due to Day of the Week"), Geom.line, Geom.point )

That's it for this analysis. We've learned that, for the being at least, we seem to be trending happier. Which is pretty cool. Of course, 6 years is not a huge amount of time for this sort of data. Let's try doing the analysis again in 10 or 20.

Will anyone still be using Twitter then?...

#### Exercises

The particularly ambitious reader may find the following exercises interesting.

##### Coding

- Hedonometer.org introduced an API while this post was already in progress. Change the code at the beginning to read from this instead of a CSV.
- The code that reruns the model using randomly constructed fake data is eligible for parallelization. Rewrite the list comprehension that calls
*buildPredictor*so it runs concurrently.

##### Modeling

- According to Google, the lunar cycle is approximately 29.53 days. Add parameters for this to the LAD model above. Does it make sense to include the lunar cycle in the model? In other words, are we lunatics?
- Some of the happier days in the Hedonometer.org data, such as Christmas, are recurring, and therefore not really outliers. How might one go about accounting for the effects of those days?
- Try the same analysis using a least-squares model. Which model is better for this data?

#### References

- Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, and Christopher M. Danforth. "Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter." PloS one 6, no. 12 (2011): e26752.
- Vanderbei, Robert J. "Local Warming." SIAM Review 54, no. 3 (2012): 597-606.