Friday, July 18, 2014

are we getting happier?

A look at trends in Hedonometer.org's happiness data using Julia, JuMP, and GLPK.

Introduction

Hedonometer.org popped onto my radar a couple weeks ago. It's a nifty project, attempting to convert samples of words found in the Twitter Gardenhose feed into a time series of happiness.

While I'm not a computational social scientist, I must say the data does have a nice intuitive quality to it. There are obvious trends in happiness associated with major holidays, days of the week, and seasons. It seems like the sort of data that could be decomposed into trends based on those various components. The Hedonometer group has, of course, done extensive analyses of their own data which you can find on their papers page.

This post examines another approach. It follows the structure of Robert Vanderbei's excellent "Local Warming" project to separate out the Hedonometer averages into daily, seasonal, solar, and day-of-the-week trends. We'll be using Julia v0.2 with JuMP and GLPK for linear optimization, Gadfly for graphing, and a few other libraries. If you haven't installed Julia, first do that. Then add all the extra requirements in the Julia shell as shown below.

Pkg.add("IJulia")
Pkg.add("DataFrames")
Pkg.add("Datetime")
Pkg.add("Gadfly")
Pkg.add("JuMP")
Pkg.add("GLPKMathProgInterface")
Pkg.add("HypothesisTests")

Alternatively, if you'd rather use an IJulia notebook for reading this post, you can download it here.

Data

Hedonometer.org introduced an API recently, but as this post was started before that happened, we'll work with the CSV data. For the time being, you can still download it here. The ambitious reader is referred to the exercise at the end of the page regarding their API.

We use the readtable function from the DataFrames package to import the data. One oddity that this function doesn't seem to deal with yet is that the header has three columns while the rest of the CSV has two. So we skip the header and add column names in later. x1 is the date column and x2 is the happiness value, in Hedonometer units.

A brief inspection shows that we are not missing any data, so we don't have to merge in a list of dates with NAs.

using DataFrames

tmpData = readtable("dailyFull.csv", header=false, skipstart=1)
head(tmpData)
6x2 DataFrame:
                  x1      x2
[1,]    "2008-09-10" 6.01006
[2,]    "2008-09-11" 6.00417
[3,]    "2008-09-12" 6.01726
[4,]    "2008-09-13" 6.02889
[5,]    "2008-09-14" 6.03155
[6,]    "2008-09-15" 5.98864

Note that the dates are strings. We vectorize the date constructor and use it to build a proper data frame.

import Datetime: date

@vectorize_1arg String date
data = DataFrame(date=date(tmpData[:, :x1]), value=tmpData[:, :x2])
head(data)
6x2 DataFrame:
              date   value
[1,]    2008-09-10 6.01006
[2,]    2008-09-11 6.00417
[3,]    2008-09-12 6.01726
[4,]    2008-09-13 6.02889
[5,]    2008-09-14 6.03155
[6,]    2008-09-15 5.98864

Let's start by making a plot of the raw data. This should look similar to the graph on their website, only not as pretty.

using Gadfly

set_default_plot_size(24cm, 12cm)
colorGradient = Scale.lab_gradient(color("crimson"), color("blue"), color("greenyellow"))
plot(
    data, 
    x = "date", 
    y = "value", 
    color = "value", 
    Guide.xlabel("Date"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Average Happiness on Twitter"),
    Scale.ContinuousColorScale(colorGradient)
)
Date 2008 2009 2010 2011 2012 2013 2014 Hedonomoeter.org Units 5.9 6.4 6.3 6.1 6.0 5.8 6.2 value 5.8 5.9 6.0 6.1 6.2 6.3 6.4 Average Happiness on Twitter

The data looks right, so we're off to a good start. Now we have to think about what sort of components we believe are important factors to this index. We'll start with the same ones as in the Vanderbei model:

  • A linear happiness trend describing how our overall happiness changes over time.
  • Seasonal trends accounting for mood changes with weather.
  • Solar cycle trends.

We'll add to this weekly trends, as zooming into the data shows we tend to be happier on the weekends than on work days. In the next section we'll build a model to separate out the effects of these trends on the Hedonometer.org index.

Model

Vanderbei's model analyzes daily temperature data for a particular location using least absolute deviations (LAD). This is similar to the well-known least squares approach, but while the latter penalizes the model quadratically more for bigger errors, the former does not. In mathematical notation, the least squares model takes in a known $m \times n$ matrix $A$ and $m \times 1$ vector $y$ of observed data, then searches for a vector $x$ such that $Ax = \hat{y}$ and $\sum_i \left\lVert y_i - \hat{y}_i \right\rVert_2^2$ is minimized.

The LAD model is similar in that it takes in the same data, but instead of minimizing the sum of the squared $L^2$ norms, it minimizes the sum of the $L^1$ norms. Thus we penalize our model using simply the absolute values of its errors instead of their squares. This makes the LAD model more robust, that is, less sensitive to outliers in our input data.

Using a robust model with this data set makes sense becuase it clearly contains a lot of outliers. While some of them, such as December 25th, may be recurrent, we're going to ignore that detail for now. After all, not every day is Christmas.

We formulate our model below using JuMP with GLPK as the solver. The code works by defining a set of variables called coefficientVars that will converge to optimal values for $x$. For each observation we compute a row of the $A$ matrix that has the following components:

  • Linear daily trend ($a_1$ = day number in the data set)
  • Seasonal variation: $\cos(2\, \pi\, a_1 / 365.25)$ and $\sin(2\, \pi\, a_1 / 365.25)$
  • Solar cycle variation: $\cos(2\, \pi\, a_1 / (10.66 \times 365.25))$ and $\sin(2\, \pi\, a_1 / (10.66 \times 365.25))$
  • Weekly variation: $\cos(2\, \pi\, a_1 / 7)$ and $\sin(2\, \pi\, a_1 / 7)$

We then add a linear variable representing the residual, or error, of the fitted model for each observation. Constraints enforce that these variables always take the absolute values of those errors.

Minimizing the sum of those residuals gives us a set of eight coefficients for the model. We return these and a function that predicts the happiness level for an offset from the first data record. (Note that the first record appears to be from Wednesday, September 10, 2008.)

using JuMP
using GLPKMathProgInterface

function buildPredictor(d)
    m = Model(solver=GLPKSolverLP())
    
    # Define a linear variable for each of our regression coefficients.
    # Note that by default, JuMP variables are unrestricted in sign.
    @defVar(m, coefficientVars[1:8])

    # Residuals are the absolute values of the error comparing our 
    # observed and fitted values.
    residuals = Array(Variable, nrow(d))

    # This builds rows for determining fitted values. The first value is
    # 1 since it is multiplied by our our trend line's offset. The other
    # values correpond to the trends described above. Sinusoidal elements
    # have two variables with sine and cosine terms.
    function buildRowConstants(a1)
        [
            1,                           # Offset
            a1,                          # Daily trend
            cos(2pi*a1/365.25),          # Seasonal variation
            sin(2pi*a1/365.25),          #
            cos(2pi*a1/(10.66*365.25)),  # Solar cycle variation
            sin(2pi*a1/(10.66*365.25)),  #
            cos(2pi*a1/7),               # Weekly variation
            sin(2pi*a1/7)                #
        ]
    end
    
    # This builds a linear expression as the dot product of a row's 
    # constants and the coefficient variables.
    buildRowExpression(a1) = dot(buildRowConstants(a1), coefficientVars)

    # Add a variable representing the absolute value of each error.
    @defVar(m, residuals[1:nrow(d)] >= 0)
    for a1 = 1:nrow(d)
        fitted = buildRowExpression(a1)

        # Linear way of representing residual >= |observed - fitted|
        @addConstraint(m, residuals[a1] >= fitted - d[a1,:value])
        @addConstraint(m, residuals[a1] >= d[a1,:value] - fitted)
    end

    # Minimize the total sum of these residuals.
    @setObjective(m, Min, sum(residuals))
    solve(m)
    
    # Return the model coefficients and a function that predicts happiness
    # for a given day, by index from the start of the data set.
    coefficients = getValue(coefficientVars)[:]
    
    # And we would like our model to work over vectors.
    predictHappiness(a1) = dot(buildRowConstants(a1), coefficients)
    return coefficients, predictHappiness
end

coefficients, predictor = buildPredictor(data)
coefficients
8-element Array{Float64,1}:
  5.92575    
  9.88082e-5 
 -0.00144717 
 -0.000969706
  0.125979   
 -0.0184735  
 -0.00210457 
 -0.0135585

The optimal values for $x$ are output above. The second value is the change in happiness per day. We can see from this that there does seem to be a small positive trend.

We can call our predictor function to obtain the fitted happiness level for any day number starting from September 10, 2008.

predictor(1000)
6.011059321544039

Confidence Intervals

We now have a set of coefficients and a predictive model. That's nice, but we'd like to have some sense of a reasonable range on our model's coefficients. For instance, how certain are we that our daily trend is really even positive? To deal with these uncertanties, we use a method called bootstrapping.

Bootstrapping involves building fake observed data based on our fitted model and its associated errors. We then fit the model to our fake data to determine new coefficients. If we repeat this enough times, we may be able to generate decent confidence intervals around our model coefficients.

First step: compute the errors between the observed and fitted data. We'll construct a new data frame that contains everything we need to construct fake data.

# Compute fitted data corresponding to our observations and their associated errors.
fitted = DataFrame(
    date = data[:, :date], 
    observed = data[:, :value],
    fitted = map(x -> predictor(x), 1:nrow(data)),
    error = map(x -> predictor(x) - data[x, :value], 1:nrow(data))
)
describe(fitted)
date
Length  2120
Type    Date{ISOCalendar}
NAs     0
NA%     0.0%
Unique  2119

observed
Min      5.881438
1st Qu.  5.97705775
Median   6.0043168
Mean     6.0123386060849064
3rd Qu.  6.041977749999999
Max      6.369217
NAs      0
NA%      0.0%

fitted
Min      5.952580232891465
1st Qu.  5.979249825128504
Median   6.004645249686684
Mean     6.009750200106241
3rd Qu.  6.0430211895222055
Max      6.074612040265234
NAs      0
NA%      0.0%

error
Min      -0.32526218693298237
1st Qu.  -0.012411496422179535
Median   0.0
Mean     -0.002588405978665409
3rd Qu.  0.011507686354999658
Max      0.13504237942082575
NAs      0
NA%      0.0%

Note that the median for our errors is exactly zero. This is a good sign.

Now we build a function that creates a fake input data set using the fitted values with randomonly selected errors. That is, for each observation, we add a randomly selected error with replacement to its corresponding fitted value. Once we've done that for every observation, we have a complete fake data set.

function buildFakeData(fitted)
    errorIndexes = rand(1:nrow(fitted), nrow(fitted))
    DataFrame(
        date = fitted[:, :date],
        value = map(x -> fitted[x, :fitted] - fitted[errorIndexes[x], :error], 1:nrow(fitted))
    )
end

# Plot some fake data to see if it looks similar.
plot(
    buildFakeData(fitted), 
    x = "date", 
    y = "value", 
    color = "value", 
    Guide.xlabel("Date"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Average Happiness on Twitter (Fake)"),
    Scale.ContinuousColorScale(colorGradient)
)
Date 2008 2009 2010 2011 2012 2013 2014 Hedonomoeter.org Units 5.9 6.4 6.3 6.1 6.0 5.8 6.2 value 5.8 5.9 6.0 6.1 6.2 6.3 6.4 Average Happiness on Twitter (Fake)

Visually, the plot of an arbitrary fake data set looks a lot like our original data, but not exactly.

Now we generate 1,000 fake data sets and run them through our optimization function above. This generates 1,000 sets of model coefficients and then computes $2\sigma$ confidence intervals around them.

The following code took just under an hour on my machine. If you're intent on running it yourself, you may want to get some coffee or a beer in the meantime.

using HypothesisTests

coeffData = [buildPredictor(buildFakeData(fitted))[1] for i = 1:1000]
confidence2Sigma = Array((Float64, Float64), length(coefficients))
for i = 1:length(coefficients)
    sample::Array{Float64}
    sample = [x[i] for x in coeffData]
    confidence2Sigma[i] = ci(OneSampleTTest(sample))
end
confidence2Sigma
8-element Array{(Float64,Float64),1}:
 (5.925619370480642,5.926250874471983)          
 (9.833341267642056e-5,9.891872644950681e-5)    
 (-0.0014847851702785182,-0.0014095257149699567)
 (-0.001010722418756523,-0.0009395274541342083) 
 (0.12556589790988862,0.12604831225962906)      
 (-0.018511935408237066,-0.018348960520221922)  
 (-0.002133611417832446,-0.002058047348615918)  
 (-0.01359168539278434,-0.01351903662344336)

Results

From the above output we can tell that we do seem to be trending happier, with a daily trend of 9.88082e-5 in Hedonometer units and a 95% confidence interval on that trend of 9.833341267642056e-5, and 9.891872644950681e-5. Cool!

Now we take a quick look at our model output. First, we plot the fitted happiness values for the same time period as the observed data. We can see that this resembles the same general trend minus the outliers. The width of the curve is due to weekly variation.

plot(
    fitted,
    x = "date",
    y = "fitted",
    Guide.xlabel("Date"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Expected Happiness on Twitter"),
    Geom.line
)
Date 2008 2009 2010 2011 2012 2013 2014 Hedonomoeter.org Units 5.95 6.00 6.05 6.10 Expected Happiness on Twitter

Now we take a look at what a typical week looks like in terms of its effect on our happiness. As September 10, 2008 was a Wednesday, we index Sunday starting at 5. The resulting graph highlights what I think we all already know about the work week.

daily(a1) = coefficients[6]*cos(2pi*a1/7) + coefficients[7]*sin(2pi*a1/7)
plot(
    x = ["Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"], 
    y = map(daily, [5, 6, 7, 1, 2, 3, 4]),
    Guide.xlabel("Day of the Week"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Variation in Happiness on Twitter Due to Day of the Week"),
    Geom.line,
    Geom.point
)
Day of the Week Sun Mon Tues Wed Thurs Fri Sat Hedonomoeter.org Units -0.02 -0.01 0.00 0.01 0.02 Variation in Happiness on Twitter Due to Day of the Week

That's it for this analysis. We've learned that, for the being at least, we seem to be trending happier. Which is pretty cool. Of course, 6 years is not a huge amount of time for this sort of data. Let's try doing the analysis again in 10 or 20.

Will anyone still be using Twitter then?...

Exercises

The particularly ambitious reader may find the following exercises interesting.

Coding
  • Hedonometer.org introduced an API while this post was already in progress. Change the code at the beginning to read from this instead of a CSV.
  • The code that reruns the model using randomly constructed fake data is eligible for parallelization. Rewrite the list comprehension that calls buildPredictor so it runs concurrently.
Modeling
  • According to Google, the lunar cycle is approximately 29.53 days. Add parameters for this to the LAD model above. Does it make sense to include the lunar cycle in the model? In other words, are we lunatics?
  • Some of the happier days in the Hedonometer.org data, such as Christmas, are recurring, and therefore not really outliers. How might one go about accounting for the effects of those days?
  • Try the same analysis using a least-squares model. Which model is better for this data?

References