Friday, July 18, 2014

are we getting happier?

A look at trends in Hedonometer.org's happiness data using Julia, JuMP, and GLPK.

Introduction

Hedonometer.org popped onto my radar a couple weeks ago. It's a nifty project, attempting to convert samples of words found in the Twitter Gardenhose feed into a time series of happiness.

While I'm not a computational social scientist, I must say the data does have a nice intuitive quality to it. There are obvious trends in happiness associated with major holidays, days of the week, and seasons. It seems like the sort of data that could be decomposed into trends based on those various components. The Hedonometer group has, of course, done extensive analyses of their own data which you can find on their papers page.

This post examines another approach. It follows the structure of Robert Vanderbei's excellent "Local Warming" project to separate out the Hedonometer averages into daily, seasonal, solar, and day-of-the-week trends. We'll be using Julia v0.2 with JuMP and GLPK for linear optimization, Gadfly for graphing, and a few other libraries. If you haven't installed Julia, first do that. Then add all the extra requirements in the Julia shell as shown below.

Pkg.add("IJulia")
Pkg.add("DataFrames")
Pkg.add("Datetime")
Pkg.add("Gadfly")
Pkg.add("JuMP")
Pkg.add("GLPKMathProgInterface")
Pkg.add("HypothesisTests")

Alternatively, if you'd rather use an IJulia notebook for reading this post, you can download it here.

Data

Hedonometer.org introduced an API recently, but as this post was started before that happened, we'll work with the CSV data. For the time being, you can still download it here. The ambitious reader is referred to the exercise at the end of the page regarding their API.

We use the readtable function from the DataFrames package to import the data. One oddity that this function doesn't seem to deal with yet is that the header has three columns while the rest of the CSV has two. So we skip the header and add column names in later. x1 is the date column and x2 is the happiness value, in Hedonometer units.

A brief inspection shows that we are not missing any data, so we don't have to merge in a list of dates with NAs.

using DataFrames

tmpData = readtable("dailyFull.csv", header=false, skipstart=1)
head(tmpData)
6x2 DataFrame:
                  x1      x2
[1,]    "2008-09-10" 6.01006
[2,]    "2008-09-11" 6.00417
[3,]    "2008-09-12" 6.01726
[4,]    "2008-09-13" 6.02889
[5,]    "2008-09-14" 6.03155
[6,]    "2008-09-15" 5.98864

Note that the dates are strings. We vectorize the date constructor and use it to build a proper data frame.

import Datetime: date

@vectorize_1arg String date
data = DataFrame(date=date(tmpData[:, :x1]), value=tmpData[:, :x2])
head(data)
6x2 DataFrame:
              date   value
[1,]    2008-09-10 6.01006
[2,]    2008-09-11 6.00417
[3,]    2008-09-12 6.01726
[4,]    2008-09-13 6.02889
[5,]    2008-09-14 6.03155
[6,]    2008-09-15 5.98864

Let's start by making a plot of the raw data. This should look similar to the graph on their website, only not as pretty.

using Gadfly

set_default_plot_size(24cm, 12cm)
colorGradient = Scale.lab_gradient(color("crimson"), color("blue"), color("greenyellow"))
plot(
    data, 
    x = "date", 
    y = "value", 
    color = "value", 
    Guide.xlabel("Date"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Average Happiness on Twitter"),
    Scale.ContinuousColorScale(colorGradient)
)

The data looks right, so we're off to a good start. Now we have to think about what sort of components we believe are important factors to this index. We'll start with the same ones as in the Vanderbei model:

  • A linear happiness trend describing how our overall happiness changes over time.
  • Seasonal trends accounting for mood changes with weather.
  • Solar cycle trends.

We'll add to this weekly trends, as zooming into the data shows we tend to be happier on the weekends than on work days. In the next section we'll build a model to separate out the effects of these trends on the Hedonometer.org index.

Model

Vanderbei's model analyzes daily temperature data for a particular location using least absolute deviations (LAD). This is similar to the well-known least squares approach, but while the latter penalizes the model quadratically more for bigger errors, the former does not. In mathematical notation, the least squares model takes in a known $m \times n$ matrix $A$ and $m \times 1$ vector $y$ of observed data, then searches for a vector $x$ such that $Ax = \hat{y}$ and $\sum_i \left\lVert y_i - \hat{y}_i \right\rVert_2^2$ is minimized.

The LAD model is similar in that it takes in the same data, but instead of minimizing the sum of the squared $L^2$ norms, it minimizes the sum of the $L^1$ norms. Thus we penalize our model using simply the absolute values of its errors instead of their squares. This makes the LAD model more robust, that is, less sensitive to outliers in our input data.

Using a robust model with this data set makes sense becuase it clearly contains a lot of outliers. While some of them, such as December 25th, may be recurrent, we're going to ignore that detail for now. After all, not every day is Christmas.

We formulate our model below using JuMP with GLPK as the solver. The code works by defining a set of variables called coefficientVars that will converge to optimal values for $x$. For each observation we compute a row of the $A$ matrix that has the following components:

  • Linear daily trend ($a_1$ = day number in the data set)
  • Seasonal variation: $\cos(2\, \pi\, a_1 / 365.25)$ and $\sin(2\, \pi\, a_1 / 365.25)$
  • Solar cycle variation: $\cos(2\, \pi\, a_1 / (10.66 \times 365.25))$ and $\sin(2\, \pi\, a_1 / (10.66 \times 365.25))$
  • Weekly variation: $\cos(2\, \pi\, a_1 / 7)$ and $\sin(2\, \pi\, a_1 / 7)$

We then add a linear variable representing the residual, or error, of the fitted model for each observation. Constraints enforce that these variables always take the absolute values of those errors.

Minimizing the sum of those residuals gives us a set of eight coefficients for the model. We return these and a function that predicts the happiness level for an offset from the first data record. (Note that the first record appears to be from Wednesday, September 10, 2008.)

using JuMP
using GLPKMathProgInterface

function buildPredictor(d)
    m = Model(solver=GLPKSolverLP())
    
    # Define a linear variable for each of our regression coefficients.
    # Note that by default, JuMP variables are unrestricted in sign.
    @defVar(m, coefficientVars[1:8])

    # Residuals are the absolute values of the error comparing our 
    # observed and fitted values.
    residuals = Array(Variable, nrow(d))

    # This builds rows for determining fitted values. The first value is
    # 1 since it is multiplied by our our trend line's offset. The other
    # values correpond to the trends described above. Sinusoidal elements
    # have two variables with sine and cosine terms.
    function buildRowConstants(a1)
        [
            1,                           # Offset
            a1,                          # Daily trend
            cos(2pi*a1/365.25),          # Seasonal variation
            sin(2pi*a1/365.25),          #
            cos(2pi*a1/(10.66*365.25)),  # Solar cycle variation
            sin(2pi*a1/(10.66*365.25)),  #
            cos(2pi*a1/7),               # Weekly variation
            sin(2pi*a1/7)                #
        ]
    end
    
    # This builds a linear expression as the dot product of a row's 
    # constants and the coefficient variables.
    buildRowExpression(a1) = dot(buildRowConstants(a1), coefficientVars)

    # Add a variable representing the absolute value of each error.
    @defVar(m, residuals[1:nrow(d)] >= 0)
    for a1 = 1:nrow(d)
        fitted = buildRowExpression(a1)

        # Linear way of representing residual >= |observed - fitted|
        @addConstraint(m, residuals[a1] >= fitted - d[a1,:value])
        @addConstraint(m, residuals[a1] >= d[a1,:value] - fitted)
    end

    # Minimize the total sum of these residuals.
    @setObjective(m, Min, sum(residuals))
    solve(m)
    
    # Return the model coefficients and a function that predicts happiness
    # for a given day, by index from the start of the data set.
    coefficients = getValue(coefficientVars)[:]
    
    # And we would like our model to work over vectors.
    predictHappiness(a1) = dot(buildRowConstants(a1), coefficients)
    return coefficients, predictHappiness
end

coefficients, predictor = buildPredictor(data)
coefficients
8-element Array{Float64,1}:
  5.92575    
  9.88082e-5 
 -0.00144717 
 -0.000969706
  0.125979   
 -0.0184735  
 -0.00210457 
 -0.0135585
 

The optimal values for $x$ are output above. The second value is the change in happiness per day. We can see from this that there does seem to be a small positive trend.

We can call our predictor function to obtain the fitted happiness level for any day number starting from September 10, 2008.

predictor(1000)
6.011059321544039

Confidence Intervals

We now have a set of coefficients and a predictive model. That's nice, but we'd like to have some sense of a reasonable range on our model's coefficients. For instance, how certain are we that our daily trend is really even positive? To deal with these uncertanties, we use a method called bootstrapping.

Bootstrapping involves building fake observed data based on our fitted model and its associated errors. We then fit the model to our fake data to determine new coefficients. If we repeat this enough times, we may be able to generate decent confidence intervals around our model coefficients.

First step: compute the errors between the observed and fitted data. We'll construct a new data frame that contains everything we need to construct fake data.

# Compute fitted data corresponding to our observations and their associated errors.
fitted = DataFrame(
    date = data[:, :date], 
    observed = data[:, :value],
    fitted = map(x -> predictor(x), 1:nrow(data)),
    error = map(x -> predictor(x) - data[x, :value], 1:nrow(data))
)
describe(fitted)
date
Length  2120
Type    Date{ISOCalendar}
NAs     0
NA%     0.0%
Unique  2119

observed
Min      5.881438
1st Qu.  5.97705775
Median   6.0043168
Mean     6.0123386060849064
3rd Qu.  6.041977749999999
Max      6.369217
NAs      0
NA%      0.0%

fitted
Min      5.952580232891465
1st Qu.  5.979249825128504
Median   6.004645249686684
Mean     6.009750200106241
3rd Qu.  6.0430211895222055
Max      6.074612040265234
NAs      0
NA%      0.0%

error
Min      -0.32526218693298237
1st Qu.  -0.012411496422179535
Median   0.0
Mean     -0.002588405978665409
3rd Qu.  0.011507686354999658
Max      0.13504237942082575
NAs      0
NA%      0.0%

Note that the median for our errors is exactly zero. This is a good sign.

Now we build a function that creates a fake input data set using the fitted values with randomonly selected errors. That is, for each observation, we add a randomly selected error with replacement to its corresponding fitted value. Once we've done that for every observation, we have a complete fake data set.

function buildFakeData(fitted)
    errorIndexes = rand(1:nrow(fitted), nrow(fitted))
    DataFrame(
        date = fitted[:, :date],
        value = map(x -> fitted[x, :fitted] - fitted[errorIndexes[x], :error], 1:nrow(fitted))
    )
end

# Plot some fake data to see if it looks similar.
plot(
    buildFakeData(fitted), 
    x = "date", 
    y = "value", 
    color = "value", 
    Guide.xlabel("Date"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Average Happiness on Twitter (Fake)"),
    Scale.ContinuousColorScale(colorGradient)
)

Visually, the plot of an arbitrary fake data set looks a lot like our original data, but not exactly.

Now we generate 1,000 fake data sets and run them through our optimization function above. This generates 1,000 sets of model coefficients and then computes $2\sigma$ confidence intervals around them.

The following code took just under an hour on my machine. If you're intent on running it yourself, you may want to get some coffee or a beer in the meantime.

using HypothesisTests

coeffData = [buildPredictor(buildFakeData(fitted))[1] for i = 1:1000]
confidence2Sigma = Array((Float64, Float64), length(coefficients))
for i = 1:length(coefficients)
    sample::Array{Float64}
    sample = [x[i] for x in coeffData]
    confidence2Sigma[i] = ci(OneSampleTTest(sample))
end
confidence2Sigma
8-element Array{(Float64,Float64),1}:
 (5.925619370480642,5.926250874471983)          
 (9.833341267642056e-5,9.891872644950681e-5)    
 (-0.0014847851702785182,-0.0014095257149699567)
 (-0.001010722418756523,-0.0009395274541342083) 
 (0.12556589790988862,0.12604831225962906)      
 (-0.018511935408237066,-0.018348960520221922)  
 (-0.002133611417832446,-0.002058047348615918)  
 (-0.01359168539278434,-0.01351903662344336)

Results

From the above output we can tell that we do seem to be trending happier, with a daily trend of 9.88082e-5 in Hedonometer units and a 95% confidence interval on that trend of 9.833341267642056e-5, and 9.891872644950681e-5. Cool!

Now we take a quick look at our model output. First, we plot the fitted happiness values for the same time period as the observed data. We can see that this resembles the same general trend minus the outliers. The width of the curve is due to weekly variation.

plot(
    fitted,
    x = "date",
    y = "fitted",
    Guide.xlabel("Date"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Expected Happiness on Twitter"),
    Geom.line
)

Now we take a look at what a typical week looks like in terms of its effect on our happiness. As September 10, 2008 was a Wednesday, we index Sunday starting at 5. The resulting graph highlights what I think we all already know about the work week.

daily(a1) = coefficients[6]*cos(2pi*a1/7) + coefficients[7]*sin(2pi*a1/7)
plot(
    x = ["Sun", "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat"], 
    y = map(daily, [5, 6, 7, 1, 2, 3, 4]),
    Guide.xlabel("Day of the Week"), 
    Guide.ylabel("Hedonomoeter.org Units"), 
    Guide.title("Variation in Happiness on Twitter Due to Day of the Week"),
    Geom.line,
    Geom.point
)

That's it for this analysis. We've learned that, for the being at least, we seem to be trending happier. Which is pretty cool. Of course, 6 years is not a huge amount of time for this sort of data. Let's try doing the analysis again in 10 or 20.

Will anyone still be using Twitter then?...

Exercises

The particularly ambitious reader may find the following exercises interesting.

Coding
  • Hedonometer.org introduced an API while this post was already in progress. Change the code at the beginning to read from this instead of a CSV.
  • The code that reruns the model using randomly constructed fake data is eligible for parallelization. Rewrite the list comprehension that calls buildPredictor so it runs concurrently.
Modeling
  • According to Google, the lunar cycle is approximately 29.53 days. Add parameters for this to the LAD model above. Does it make sense to include the lunar cycle in the model? In other words, are we lunatics?
  • Some of the happier days in the Hedonometer.org data, such as Christmas, are recurring, and therefore not really outliers. How might one go about accounting for the effects of those days?
  • Try the same analysis using a least-squares model. Which model is better for this data?

References

Friday, June 27, 2014

preprocessing for routing problems: part 2

In the previous post, we considered preprocessing for the vehicle routing problem where the vehicles have different starting locations. Our goal was to create potentially overlapping regions for the entire US which we could later use for route construction. We defined these regions using all 5-digit zip codes in the continental US for which one of our regional headquarters is the closest, or one of $n$ closest, headquarters in terms of Euclidean distance. The resulting regions gave us some flexibility in terms of how much redundancy we allow in our coverage of the country.

This post refines those regions, replacing the Euclidean distance with a more realistic metric: estimated travel time. Doing this should give us a better sense of how much space a given driver can actually cover. It should also divide the country up more equitably among our drivers.

Our approach here will be similar to that of the last post, but instead of ranking our headquarter-zip pairs by Euclidean distance, we'll rank them by estimated travel time. The catch is that, while the former is easy to compute using the SpatialTools library, we have to request the latter from a third party. In this post, we'll use the MapQuest Route Matrix, since it provides estimates based on OpenStreetMap data to us for free, and doesn't cap the number of requests we can make.

To do this we're going to need a lot of point estimates for location-to-location travel times. In fact, building a full data set for replacing our Euclidean distance ranks would require $37,341 \times 25 = 933,525$ travel time estimates. That's a bit prohibitive. The good news is we don't need to all the data points unless we generate 25 levels of redundancy. We can just request enough travel time estimates to make us reasonably certain that we've got all the necessary data. In the last post we generated regions for 1, 2, and 3 levels of redundancy, so here we'll get travel times for the 10 closest headquarters to each zip code, and take the leap of faith that the closest 3 headquarters by travel time for each zip will be among the 10 closest by Euclidean distance.

Note: I've provided the output data so you don't have to make hundreds of thousands of requests to MapQuest yourself. Instead of running the following code, please download it here. You can then run the section at the end that generates the maps.

Let's assume that you have just executed the code from the last post and have its variables in your current scope. If you need it, you can find that code here. First, we define some constants we're going to need in order to make MapQuest requests.

# Define some constants for making requests to MapQuest and determining
# when to save and what to request.
library(RCurl)
library(rjson)
library(utils)
 
MAPQUEST_API_KEY <- 'YOUR KEY HERE'
MAPQUEST_API_URL <- 'http://www.mapquestapi.com/directions/v2/routematrix?key=%s&json=%s'
ZIPS_BETWEEN_SAVE <- 250
HQ_RANK_MIN <- 1  # Min/max distance ranks for time estimates
HQ_RANK_MAX <- 10

Now we create a data frame to hold our HQ-to-zip travel estimates. The rows correspond to zip codes and the columns correspond to our headquarter locations. We initialize the data frame to contain no estimates and write it to a CSV file. Since it will take on the order of days for us to fill this file in, we're going to write it out and read it back in periodically. That way we can pick up where we left off by simply rerunning the code in case of an error or loss of network connectivity.

# Write out a blank file containing our time estimates.
TIME_CSV_PATH <- 'hqs_to_zips_time.csv'
if (!file.exists(TIME_CSV_PATH)) {    
    # Clear out everything except row and column names.
    empty <- as.data.frame(matrix(nrow=nrow(zips_deduped), ncol=nrow(largest_cities)+1))
    names(empty) <- c('zip', largest_cities$name)
    empty$zip <- zips_deduped$zip
 
    # This represents our current state.
    write.csv(empty, TIME_CSV_PATH, row.names=F)    
}

# Read in our current state in case we are starting over.
hqs_to_zips_time <- read.csv(TIME_CSV_PATH)
hqs_to_zips_time$zip <- sprintf('%05d', hqs_to_zips_time$zip)
 
# Sanity check: If we have any times = 0, set them to NA so that we re-request them.
hqs_to_zips_time[hqs_to_zips_time <= 0] <- NA

With that file created, we can start making requests to MapQuest's Route Matrix. For each zip code, we are going to request travel times for its 10 closest HQs. We'll save our time estimates data frame every 250 zip codes. Also, we're going to randomize the order of the zip codes so we fill out our data set more evenly as we go. That way we can generate maps during the process or otherwise inspect the data as we go.

# Now we start requesting travel times from MapQuest.
requests_until_save <- ZIPS_BETWEEN_SAVE
col_count <- ncol(hqs_to_zips_time)
 
# Randomize the zip code order so we fill in the map uniformly as we get more data.
# This will enable us to check on our data over time and make sure it looks right.
for (zip_idx in sample(1:nrow(zips_deduped))) {    
    z <- zips_deduped$zip[zip_idx]
    z_lat <- zips_deduped$latitude[zip_idx]
    z_lon <- zips_deduped$longitude[zip_idx]
 
    # Find PODs for this zip that are in the rank range.
    which_hqs <- which(
        hqs_to_zips_rank[,zip_idx] >= HQ_RANK_MIN &
        hqs_to_zips_rank[,zip_idx] <= HQ_RANK_MAX
    )
 
    # We're only interested in records that aren't filled in yet.
    na_pods <- is.na(hqs_to_zips_time[zip_idx, which_hqs+1])
    if (length(hqs_to_zips_time[zip_idx,2:col_count][na_pods]) < 1) {
        next
    }
 
    # Request this block of PODs and fill them all in.
    print(sprintf('requesting: zip=%s rank=[%d-%d]', z, HQ_RANK_MIN, HQ_RANK_MAX))
 
    # Construct a comma-delimited string of lat/lons containing the locations of our
    # HQs We will use this for our MapQuest requests below: for each zip code, we
    # make one request for its distance to every HQ in our range.
    hq_locations <- paste(
        sprintf("'%f,%f'", largest_cities$lat[which_hqs], largest_cities$long[which_hqs]),
        collapse = ', '
    )
 
    # TODO: make sure we are requesting from location 1 to 2:n only
    request_json <- URLencode(sprintf(
        "{allToAll: false, locations: ['%f,%f', %s]}",
        z_lat,
        z_lon,
        hq_locations
    ))
    url <- sprintf(MAPQUEST_API_URL, MAPQUEST_API_KEY, request_json)
    result <- fromJSON(getURL(url))
 
    # If we get back 0s, they should be NA. Otherwise they'll mess up our
    # rankings and region drawing later.
    result$time[result$time <= 0] <- NA
 
    hqs_to_zips_time[zip_idx, which_hqs+1] <- result$time[2:length(result$distance)]
 
    # See if we should save our current state.
    requests_until_save <- requests_until_save - 1
    if (requests_until_save < 1) {
        print('saving current state')
        write.csv(hqs_to_zips_time, TIME_CSV_PATH, row.names=F)                
        requests_until_save <- ZIPS_BETWEEN_SAVE
    }
}
 
# Final save once we are done.
write.csv(hqs_to_zips_time, TIME_CSV_PATH, row.names=F)

Assuming you didn't actually run that code and instead opted to download the data, run the following code to read it in.

hqs_to_zips_time <- read.csv(TIME_CSV_PATH)
hqs_to_zips_time$zip <- sprintf('%05d', hqs_to_zips_time$zip)

Now we generate our ranks based on travel time instead of distance. We have to be a bit more careful this time, as we might have incomplete data. We don't want pairs with travel time of NA showing up in the rankings.

# Rank HQs by their distance to each unique zip code location.
hqs_to_zips_rank2 <- matrix(nrow=nrow(largest_cities), ncol=nrow(zips_deduped))
for (i in 1:nrow(zips_deduped)) {
    not_na <- !is.na(hqs_to_zips_time[i,2:ncol(hqs_to_zips_time)])
    hqs_to_zips_rank2[not_na,i] <-
        rank(hqs_to_zips_time[i,2:ncol(hqs_to_zips_time)][not_na], ties.method='first')
}

We build our map for the Dallas, TX headquarter the same way as before.

# Now we draw regions for which Dallas is one of the closest 3 HQs by time.
hq_idx <- which(largest_cities$name == 'Dallas TX')
redundancy_levels <- c(3, 2, 1)
fill_alpha <- c(0.15, 0.30, 0.45)
 
map('state')
for (i in 1:length(redundancy_levels)) {
    # Find every zip for which this HQ is within n in time rank.
    within_n <- hqs_to_zips_rank2[hq_idx,] <= redundancy_levels[i]
 
    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_n], 
        zips_deduped$latitude[within_n]
    )
    hull_x <- zips_deduped$longitude[within_n][hull_order]
    hull_y <- zips_deduped$latitude[within_n][hull_order]
    polygon(hull_x, hull_y, border='blue', col=rgb(0, 0, 1, fill_alpha[i]))   
}
 
# The other HQs.
other_hqs = 1:nrow(largest_cities) != hq_idx
points(
    largest_cities$long[other_hqs],
    largest_cities$lat[other_hqs], 
    pch = 21, 
    bg = rgb(0.4, 0.4, 0.4, 0.6),
    col = 'black',
    cex = 1.5
)
 
# This HQ.
points(
    largest_cities$long[hq_idx],
    largest_cities$lat[hq_idx], 
    pch = 21, 
    bg = rgb(1, 0, 0, .85),
    col = 'black',
    cex = 1.5
)

This shows the regions for which Dallas is among the closest headquarters for 1, 2, and 3 level of redundancy. Compare this map to the one from the previous post, and you'll see that it conforms better to the highway system. For instance, it takes into account I-20 which moves east and west across Texas, instead of pushing up into the Dakotas.

Travel time-based regions for Dallas, TX

And now our map of the US, showing the regions for each HQ as the set of zip codes for which it is the closest.

# Map of regions where every zip is served only by its closest HQ.
map('usa')
for (hq_idx in 1:nrow(largest_cities)) {
    # Find every zip for which this HQ is the closest.
    within_1 <- hqs_to_zips_rank2[hq_idx,] == 1
    within_1[is.na(within_1)] <- F
 
    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_1], 
        zips_deduped$latitude[within_1]
    )
    hull_x <- zips_deduped$longitude[within_1][hull_order]
    hull_y <- zips_deduped$latitude[within_1][hull_order]
    polygon(
        hull_x,
        hull_y,
        border = 'black',
        col = rgb(0, 0, 1, 0.25)
    )
}
 
# All HQs
points(
    largest_cities$long,
    largest_cities$lat, 
    pch = 21, 
    bg = rgb(1, 0, 0, .75),
    col = 'black',
    cex = 1.5
)

This gives us our new map. If we compare this with the original, it should better reflect the topology of the highway system. It also looks a bit less jagged.

All regions with redundancy level of 1

Exercises for the reader:

  • Some of these regions overlap, even though they are supposed to be only composed of zip codes for which a given HQ is the closest. Why is that?
  • Say we want to limit our driver to given maximum travel times. Based on our data from MapQuest, draw concentric circles representing approximate 3, 5, and 7 hour travel time regions.

Wednesday, May 28, 2014

preprocessing for routing problems: part 1

Consider an instance of the vehicle routing problem in which we have drivers that are geographically distributed, each in a unique location. Our goal is to deliver goods or services to a set of destinations at the lowest cost. It does not matter to our customers which driver goes to which destination, so long as the deliveries are made.

One can think of this problem as a collection of travelling salesman problems, where there are multiple salespeople in different locations and a shared set of destinations. We attempt to find the minimum cost schedule for all salespeople that visits all destinations, where each salesman can technically go anywhere.

We believe that sending a driver farther will result in increased cost. But, given a particularly good tour, we might do that anyway. On the other hand, there are plenty of assignments we would never consider. It would be madness to send a driver from Los Angeles to New York if we already have another person stationed near there. Thus there are a large number of scenarios that may be possible, but that we will never pursue.

Our ultimate goal is to construct a model that finds an optimal (or near-optimal) schedule. Before we get to that, we have a bit of preprocessing to do. We would like to create regions for our drivers that make some bit of sense, balancing constraints on travel time with redundant coverage of our customers. Once we have these regions, we will know where we can allow our drivers to go in the final schedule.

Let's get started in R. We'll assume that we have drivers stationed at our regional headquarters in the 25 largest US cities by population. We assume that every possible customer address will be in some five digit zip code in the continental US. We pull this information out of the standard R data sets and pare down to only unique locations, fixing a couple errors in the data along the way.

library(datasets)
library(zipcode)
data(zipcode)
 
# Alexandria, VA is not in Normandy, France.
zipcode[zipcode$zip=='22350', c('latitude', 'longitude')] <- c(38.863930, -77.055547)
 
# New York City, NY is not in Kyrgyzstan.
zipcode$longitude[zipcode$zip=='10200'] <- -zipcode$longitude[zipcode$zip=='10200']
 
# Pare down to zip codes in the continental US.
states_continental <- state.abb[!(state.abb %in% c('AK', 'HI'))]
zips_continental <- subset(zipcode, state %in% states_continental)
zips_deduped <- zips_continental[!duplicated(zips_continental[, c('latitude', 'longitude')]), ]
 
# Geographic information for top 25 cities in the country.
library(maps)
data(us.cities)
largest_cities <- subset(
    us.cities[order(us.cities$pop, decreasing=T),][1:25,],
    select = c('name', 'lat', 'long')
)

With this information we can get some sense of what we're up against. We generate a map off all the zip code locations in blue and our driver locations in red.

# Plot our corporate headquarters and every unique zip code location.
map('state')
points(zips_deduped$longitude, zips_deduped$latitude, pch=21, col=rgb(0, 0, 1, .5), cex=0.1)
points(largest_cities$long, largest_cities$lat, pch=21, bg=rgb(1, 0, 0, .75), col='black', cex=1.5)
Zip code and driver locations

So how do we go about assigning zip codes to our drivers? One option is to draw circles of a given radius around our drivers and increase that radius until we have the coverage we need.

Radius-based regions

On second thought, that doesn't work so well. By the time we have large enough radius, there will be so much overlap the assignments won't make much sense. It would be better if we started by assigning each zip code to the driver that is physically closest. We could then start introducing redundancy into our data by adding the second closest driver, and so on.

# Euclidean distance from each HQ to each zip code.
library(SpatialTools)
zips_to_hqs_dist <- dist2(
    matrix(c(zips_deduped$longitude, zips_deduped$latitude), ncol=2),
    matrix(c(largest_cities$long, largest_cities$lat), ncol=2)
)
 
# Rank HQs by their distance to each unique zip code location.
hqs_to_zips_rank <- matrix(nrow=nrow(largest_cities), ncol=nrow(zips_deduped))
for (i in 1:nrow(zips_deduped)) {
    hqs_to_zips_rank[,i] <- rank(zips_to_hqs_dist[i,], ties.method='first')
}

Let's see what this looks like on the map. The following shows what the region for the Dallas, TX driver would be if she were only allowed to visit zip codes for which she is the closest, second closest, and third closest. We map these as polygons using the convex hull of their respective zip code locations.

# Now we draw regions for which Dallas is one of the closest 3 HQs.
hq_idx <- which(largest_cities$name == 'Dallas TX')
redundancy_levels <- c(3, 2, 1)
fill_alpha <- c(0.15, 0.30, 0.45)
 
map('state')
for (i in 1:length(redundancy_levels)) {
    # Find every zip for which this HQ is within n in distance rank.
    within_n <- hqs_to_zips_rank[hq_idx,] <= redundancy_levels[i]
 
    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_n], 
        zips_deduped$latitude[within_n]
    )
    hull_x <- zips_deduped$longitude[within_n][hull_order]
    hull_y <- zips_deduped$latitude[within_n][hull_order]
    polygon(hull_x, hull_y, border='blue', col=rgb(0, 0, 1, fill_alpha[i]))   
}
 
# The other HQs.
other_hqs = 1:nrow(largest_cities) != hq_idx
points(
    largest_cities$long[other_hqs],
    largest_cities$lat[other_hqs], 
    pch = 21, 
    bg = rgb(0.4, 0.4, 0.4, 0.6),
    col = 'black',
    cex = 1.5
)
 
# This HQ.
points(
    largest_cities$long[hq_idx],
    largest_cities$lat[hq_idx], 
    pch = 21, 
    bg = rgb(1, 0, 0, .85),
    col = 'black',
    cex = 1.5
)
Euclidean distance-based regions for Dallas, TX

This makes a bit more sense. If we enforce a redundancy level of 1, then every zip code has exactly one person assigned to it. As we increase that redundancy level, we have more options in terms of driver assignment. And our optimization model will grow correspondingly in size.

The following produces a map of all our regions where each zip code is served only by its closest driver.

# Map of regions where every zip is served only by its closest HQ.
map('usa')
for (hq_idx in 1:nrow(largest_cities)) {
    # Find every zip for which this HQ is the closest.
    within_1 <- hqs_to_zips_rank[hq_idx,] == 1
    within_1[is.na(within_1)] <- F
 
    # Convex hull of zip code points.
    hull_order <- chull(
        zips_deduped$longitude[within_1], 
        zips_deduped$latitude[within_1]
    )
    hull_x <- zips_deduped$longitude[within_1][hull_order]
    hull_y <- zips_deduped$latitude[within_1][hull_order]
    polygon(
        hull_x,
        hull_y,
        border = 'black',
        col = rgb(0, 0, 1, 0.25)
    )
}
 
# All HQs
points(
    largest_cities$long,
    largest_cities$lat, 
    pch = 21, 
    bg = rgb(1, 0, 0, .75),
    col = 'black',
    cex = 1.5
)
All regions with redundancy level of 1

This is a good start. Our preprocessing step gives us a reasonable level of control over the assignments of drivers before we begin optimizing. So what's missing?

One immediately apparent failure is that these regions are based on Euclidean distance. Travel time is not a simple function of that. It would be much better if we could create regions using estimated time, drawing them based on topology of the highway system. We'll explore techniques for doing so in the next post.

Friday, February 14, 2014

capturing stdout in a python child process

Lately at work I've been putting together a web management console that handles execution of some of our long running Gurobi models. This makes a lot of sense to do since the models run on Gurobi Cloud for hours or even days before achieving their desired optimality gaps. Gurobi Cloud requires a persistent connection, so we use a server to handle that instead of solving models from our desktop machines. In a sense, this is a bit of a throwback to the time sharing systems from before the PC era.

As a result, I've had to update my dormant (and, to be honest, never particularly acute) administration and systems programming skills. There are always some interesting gotchas that come up in this sort of work. This time I ran into the following conundrum: I have a cron job that is responsible for constructing model instances in a child process, handing them off to Gurobi Cloud, and saving their results. Its output is redirected to a file so I can tail it in real time and store it for later. For the web console, I also want to store the output to a database so I can easily display it. And I'd like to do this in as generic a manner as possible, so I can add other types of models (custom heuristics, for instance) and also capture anything they write to standard output and standard error.

There are a number of threads available on capturing standard output. Some of the advice I found recommended using os.dup2 to clone the stdout file descriptor into another file. The problem with this approach is it requires having an open file, which is inconvenient, and it ends up closing the existing stdout. So the output does get redirected, but then it disappears from my log. Also, I could only seem to capture it once. I wanted to do this repeatedly.

Other advice got me closer to the mark: Replace sys.stdout with an instance of StringIO, then just read its contents later. This allowed me to do redirections in serial, but once again meant that standard output was essentially shut off.

The final solution was to subclass StringIO and override its write method. Every time write is called, it writes to both the in-memory string and a canonical reference to stdout. Example code is given below that captures stdout and stderr from a child process while still letting their messages be visible.

from StringIO import StringIO
from multiprocessing import Process, Queue
import sys

# Proxy classes that both write to and capture strings
# sent to stdout or stderr, respectively.
class TeeOut(StringIO):
    def write(self, s):
        StringIO.write(self, s)
        sys.__stdout__.write(s)

class TeeErr(StringIO):
    def write(self, s):
        StringIO.write(self, s)
        sys.__stderr__.write(s)

class Parent(object):
    def run(self):
        # Start child process.
        queue = Queue()
        child = Process(target=self.spawn, args=(queue,))
        child.start()
        child.join()

        # Do someting with out and err...
        out = queue.get()
        err = queue.get()

    def spawn(self, queue):
        # Save everything that would otherwise go to stdout.
        out = TeeOut()
        sys.stdout = out

        err = TeeErr()
        sys.stderr = err

        try:
            # Do something...
            print 'out 1'
            print 'out 2'
            sys.stderr.write('error 1\n')
            print 'out 3'
            sys.stderr.write('error 2\n')

        finally:
            # Restore stdout, stderr and send their contents.
            sys.stdout = sys.__stdout__
            sys.stderr = sys.__stderr__
            queue.put(out.getvalue())
            queue.put(err.getvalue())

if __name__ == '__main__':
    Parent().run()

A few minor points:

  • StringIO can't be subclassed like object-based types, thus lines 9 and 14 can't use super.
  • The sys module keeps two references to stdout: sys.stdout and sys.__stdout__. We use the latter to recover the former after processing.
  • The contents of standard out and standard error are transmitted from the child process to its parent at the end of processing. A smarter implementation might transfer chunks of them periodically, but I'm not expecting them to be large.
  • The transfer happens in a finally block. Thus any run time exception in the child will not interfere with logging.

Monday, February 3, 2014

chebyshev centers of polygons with gurobipy

A common problem in handling geometric data is determining the center of a given polygon. This is not quite so easy as it sounds as there is not a single definition of center that makes sense in all cases. For instance, sometimes computing the center of a polygon's bounding box may be sufficient. In some instances this may give a point on an edge (consider a right triangle). If the given polygon is non-convex, that point may not even be inside or on its boundary.

This post looks at computing Chebyshev centers for arbitrary convex polygons. We employ essentially the same model as in Boyd & Vandenberghe's Convex Optimization text, but using Gurobi instead of CVXOPT.

Consider a polygon defined by the intersection of a finite number of half-spaces, $Au \le b$. We assume we are given the set of vertices, $V$, in clockwise order around the polygon. $E$ is the set of edges connecting these vertices. Each edge in $E$ defines a boundary of the half-space $a_i'u \le b_i$.

(x, y)
$V = \{(1,1), (2,5), (5,4), (6,2), (4,1)\}$
$E = \{((1,1),(2,5)), ((2,5),(5,4)), ((5,4),(6,2)), ((6,2),(4,1)), ((4,1),(1,1))\}$

The Chebyshev center of this polygon is the center point $(x, y)$ of the maximum radius inscribed circle. That is, if we can find the largest circle that will fit inside our polygon without going outside its boundary, its center is the point we are looking for. Our decision variables are the center $(x, y)$ and the maximum inscribed radius, $r$.

In order to do this, we consider the edges independently. The long line segment below shows an arbitrary edge, $a_i'u \le b_i$. The short line connected to it is orthogonal in the direction $a$. $(x, y)$ satisfies the inequality.

(x, y)

The shortest distance from $(x, y)$ will be in the direction of $a$. We'll call this distance $r$. If we were to move the edge so it had the same slope but went through $(x, y)$, its distance from $a_i'u = b_i$ would be $r||a_i||_2$. Thus we can add a constraint of the form $a_i'u + r||a_i||_2 \le b_i$ for each edge and maximize the value of $r$ as our objective function.

max $r$
subject to $(y_i-y_j)x + (x_j-x_i)y + r\sqrt{(x_j-x_i)^2 + (y_j-y_i)^2} \le (y_i-y_j)x_i + (x_j-x_i)y_i\, \forall\, ((x_i,y_i), (x_j,y_j)) \in E$

As this is linear, we can solve it using any LP solver. The following code does so with Gurobi.

#!/usr/bin/env python
from gurobipy import Model, GRB
from math import sqrt

vertices = [(1,1), (2,5), (5,4), (6,2), (4,1)]
edges = zip(vertices, vertices[1:] + [vertices[0]])

m = Model()
r = m.addVar()
x = m.addVar(lb=-GRB.INFINITY)
y = m.addVar(lb=-GRB.INFINITY)
m.update()

for (x1, y1), (x2, y2) in edges:
    dx = x2 - x1
    dy = y2 - y1
    m.addConstr((dx*y - dy*x) + (r * sqrt(dx**2 + dy**2)) <= dx*y1 - dy*x1)

m.setObjective(r, GRB.MAXIMIZE)
m.optimize()

print 'r = %.04f' % r.x
print '(x, y) = (%.04f, %.04f)' % (x.x, y.x)

The model output shows our center and its maximum inscribed radius.

r = 1.7466
(x, y) = (3.2370, 2.7466)
(x, y)

Question for the reader: in certain circumstances, such as rectangles, the Chebyshev center is ambiguous. How might one get around this ambiguity?

Monday, July 29, 2013

network splitting

Last week, Paul Rubin wrote an excellent post on Extracting a Connected Graph from an existing graph. Lately I've been performing related functions on data from OpenStreetMap, though without access to a solver. In my case I'm taking in arbitrary network data and splitting it into disconnected subnetworks. I thought it might be a good case study to show an algorithmic way doing this and some of the performance issues I ran into.

A small example can be seen below (click on the image to enlarge it). This shows a road network around the Las Vegas strip. There is one main (weakly) connected network in black. The roads highlighted in red are disconnected from the main network. We want code that will split these into connected subnetworks.

Say we have data that looks like the following. Instead of nodes, the numbers in quotes represent edges. Think of these as streets.

{
    "0": [1, 2, 3],
    "1": [9248, 9249, 9250],
    "2": [589, 9665, 9667],
    "3": [0, 5, 6],
    "4": [0, 5, 6],
    "5": [588],
    "6": [4, 8, 9],
    ...
}

Our basic strategy is the following:

  1. Start with every edge alone in its own subnetwork.
  2. For each connection, merge the networks of the source and destination edges.

#!/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print time.asctime(), 'parsing json input'
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print time.asctime(), 'detecting disconnected subgraphs'
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))
        
        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    to_net.update(from_net)
                    for e in from_net:
                        edge_to_net[e] = to_net
                    nets.remove(from_net)
                    from_net = to_net

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print time.asctime(), len(nets), 'subnets found'

We run this against the network pictured above and it works reasonably quickly, finishing in about 7 seconds:

Mon Jul 29 12:22:38 2013 parsing json input
Mon Jul 29 12:22:38 2013 detecting disconnected subgraphs
Mon Jul 29 12:22:38 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:22:44 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:22:45 2013 60 subnets found    

However, when run against a road network for an entire city, the process continues for several hours. What is the issue?

The inefficiency occurs from lines 46 to 50. In this we are frequently removing references to every element in a large set. Instead, it would be better to remove as few references as possible. Therefore, instead of merging from_net into to_net, we will determine which network is the smaller of the two and marge that one into the larger one. Note that this does not necessarily change the worst case time complexity of the algorithm, but it should make the code fast enough to be useful. The new version appears below.

#!/usr/bin/env python
import json
import sys
import time

class hset(set):
    '''A hashable set. Note that it only hashes by the pointer, and not by the elements.'''
    def __hash__(self):
        return hash(id(self))

    def __cmp__(self, other):
        return cmp(id(self), id(other))

if __name__ == '__main__':
    try:
        inputfile = sys.argv[1]
    except:
        print 'usage: %s network.json' % sys.argv[0]
        sys.exit()

    print time.asctime(), 'parsing json input'
    connections = json.load(open(inputfile))

    edge_to_net = {} # Edge ID -> set([edges that are in the same network])
    nets = set()     # Set of known networks

    print time.asctime(), 'detecting disconnected subgraphs'
    for i, (from_edge, to_set) in enumerate(connections.iteritems()):
        from_edge = int(from_edge)

        try:
            from_net = edge_to_net[from_edge]
        except KeyError:
            from_net = edge_to_net[from_edge] = hset([from_edge])
            nets.add(from_net)

        if not (i+1) % (25 * 1000):
            print time.asctime(), '%d edges processed / %d current subnets' % (i+1, len(nets))
        
        for to in to_set:
            try:
                to_net = edge_to_net[to]

                # If we get here, merge the to_net into the from_net.
                if to_net is not from_net:
                    # Update references to and remove the smaller set for speed.
                    if len(to_net) < len(from_net):
                        smaller, larger = to_net, from_net
                    else:
                        smaller, larger = from_net, to_net

                    larger.update(smaller)
                    for e in smaller:
                        edge_to_net[e] = larger
                    nets.remove(smaller)
                    edge_to_net[to] = larger
                    from_net = larger

            except KeyError:
                from_net.add(to)
                edge_to_net[to] = from_net

    print time.asctime(), len(nets), 'subnets found'

Indeed, this is significantly faster. And on very large networks it runs in minutes instead of hours or days. On the small test case used for this post, it runs in under a second. While this could probably be done faster, that's actually good enough for right now.

Mon Jul 29 12:39:55 2013 parsing json input
Mon Jul 29 12:39:55 2013 detecting disconnected subgraphs
Mon Jul 29 12:39:55 2013 25000 edges processed / 1970 current subnets
Mon Jul 29 12:39:55 2013 50000 edges processed / 124 current subnets
Mon Jul 29 12:39:55 2013 60 subnets found

Wednesday, July 17, 2013

stack in go

I've been using an increasing amount of Go lately. One of my favorite features is its ability to add methods to any existing type. This is possible for primitive types by simply providing an alias for them, similar to using typedef in C.

Recently, I found I needed a stack container. Nothing complicated, merely an implementation of push, pop, and empty methods. While this is easy enough to do using slices, wrapping the required code makes for a much nicer interface.

package main

import "fmt"

type Element float64
type Stack []Element

func (s *Stack) Push(e Element) {
    *s = append(*s, e)
}

func (s *Stack) Pop() Element {
    e := (*s)[len(*s)-1]
    *s = (*s)[:len(*s)-1]
    return e
}

func (s *Stack) Empty() bool {
    return len(*s) < 1
}

func main() {
    s := Stack([]Element{})

    for i := 0; i < 10; i++ {
        s.Push(Element(i))
    }

    for !s.Empty() {
        fmt.Println("pop:", s.Pop())
    }
}

Since there aren't real classes in Go, or inheritance, this doesn't do anything scary like monkey patching. It's just a bit of syntactic sugar to make our lives better. For instance, the Push method could just as easily look like this to the compiler:

func Push(s *Stack, e Element) {
    *s = append(*s, e)
}

Monday, October 8, 2012

web slides with python & math using deck.js, syntaxhighlighter, and mathjax

In putting together some slides for class that include both Python code and math models, I realized that it was high time to update the technology stack I'm using. If anyone else finds this useful, please feel free to copy my starting template and use or improve upon it.

I'm hosting my slides directly out of Dropbox since I can do so without uploading them to a web server. It also means I can edit and view them locally without have to go through any sort of manual syncing process. This is possibly the simplest way in existence to share a page on the internet.

Steps:

  1. Create the directories /public/, /public/www/, /public/www/slides/, and /public/www/slides/boilerplate/ in your Dropbox folder.
  2. Download the latest version of deck.js and decompress it into /public/www/deck.js/.
  3. Download SyntaxHighlighter and decompress it into /public/www/syntaxhighlighter/.
  4. Download MathJax and decompress it into /public/www/mathjax/.
  5. Save this template into /public/www/slides/boilerplate/boilerplate.html.

Now to create a new slide deck, copy the boilerplate directory to something new in /public/www/slides/. You should be able to begin with something that looks like this. Search through its source for "slides start" to see how to create slides that show beautifully rendered Python and LaTeX.

Use the left and right arrow keys or the icons that appear when you hover over the left and right of the slides to navigate.

Monday, September 24, 2012

roll your own gurobi shell

Note: These instructions have been updated to use the setup.py from the Gurobi distribution.

I've recently started using Gurobi for a class. While I'm pleased with the level of Python support they provide, I'm not a fan of the Gurobi shell they ship it with.

  • The shell is Python, which is good. However, instead of bundling the Gurobi libraries as a Python module that I can install alongside my system's native Python, they ship their own Python binaries. If I want to use any third-party modules I have to make them available to Gurobi somehow.
  • They appear to have removed distutils from the Python they ship.
  • The Gurobi shell doesn't give me module docs, tab completion, or anything I'd get from either ipython or bpython. It's nothing more than the rudimentary cPython shell.

They do, however, provide a setup.py script, so we don't have to use gurobi.sh or the Python that's in the distribution. Here are simple instructions for using bpython as your shell. You should have Gurobi 5.0.1 installed and the GUROBI_HOME environment variable set. (I assume you are installing it into your home directory.)

$ mkdir -p $HOME/lib/python2.7/site-packages/
$ cd $GUROBI_HOME
$ PYTHONPATH=$HOME/lib/python2.7/site-packages/ python setup.py install --prefix=$HOME

Update your environment's PYTHONPATH and LD_LIBRARY_PATH:

$ echo "export PYTHONPATH=\$PYTHONPATH:\$HOME/lib/python2.7/site-packages/" >> $HOME/.bashrc
$ echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:\$GUROBI_HOME/lib/" >> $HOME/.bashrc
$ source $HOME/.bashrc

Install bpython:

$ sudo apt-get install bpython

Now type bpython and you should have the lovely interface you see below, complete with tab completion and module docs. Isn't that better?

Saturday, September 22, 2012

lagrangian relaxation with gurobipy

Note: this is still a new technique to me, so please point out any errors and I will correct them in the post.

Correction: $b_{ij}$ is now $b_j$ in the second set of constraints.

We've been studying Lagrangian Relaxation (LR) in the combinatorial optimization course I'm taking this term, and I had some difficulty finding a simple example covering its application. In case anyone else finds it useful, I'm posting a Python version for solving the Generalized Assignment Problem (GAP). This won't discuss the theory of LR at all, just give example code using Gurobi 5.0.1.

The GAP as defined by Wolsey consists of a maximization problem subject to a set of set packing constraints followed by a set of knapsack constraints:

max $\sum_i \sum_j c_{ij} x_{ij}$
subject to $\sum_j x_{ij} \leq 1 \forall i$
$\sum_i a_{ij} x_{ij} \leq b_{ij} \forall j$
$x_{ij} \in \{0, 1\}$

An unadulterated version of this BIP using gurobipy might look like the following:

#!/usr/bin/env python

# This is the GAP per Wolsey, pg 182

from gurobipy import GRB, Model

model = Model('GAP per Wolsey')
model.modelSense = GRB.MAXIMIZE
model.setParam('OutputFlag', False) # turns off solver chatter

b = [15, 15, 15]
c = [
    [ 6, 10,  1],
    [12, 12,  5],
    [15,  4,  3],
    [10,  3,  9],
    [ 8,  9,  5]
]
a = [
    [ 5,  7,  2],
    [14,  8,  7],
    [10,  6, 12],
    [ 8,  4, 15],
    [ 6, 12,  5]
]

# x[i][j] = 1 if i is assigned to j
x = []
for i in range(len(c)):
    x_i = []
    for j in c[i]:
        x_i.append(model.addVar(vtype=GRB.BINARY))
    x.append(x_i)

# We have to update the model so it knows about new variables
model.update()

# sum j: x_ij <= 1 for all i
for x_i in x:
    model.addConstr(sum(x_i) <= 1)

# sum i: a_ij * x_ij <= b[j] for all j
for j in range(len(b)):
    model.addConstr(sum(a[i][j] * x[i][j] for i in range(len(x))) <= b[j])

# max sum i,j: c_ij * x_ij
model.setObjective(
    sum(
        sum(c_ij * x_ij for c_ij, x_ij in zip(c_i, x_i))
        for c_i, x_i in zip(c, x)
    )
)
model.optimize()

# Pull objective and variable values out of model
print 'objective =', model.objVal
print 'x = ['
for x_i in x:
    print '   ', [1 if x_ij.x >= 0.5 else 0 for x_ij in x_i]
print ']'

The solver quickly finds the following optimal solution of this toy problem:

objective = 46.0
x = [
    [0, 1, 0]
    [0, 1, 0]
    [1, 0, 0]
    [0, 0, 1]
    [0, 0, 0]
]

There are two sets of constraints we can dualize. It can be beneficial to apply Lagrangian Relaxation against problems composed of knapsack constraints, so we will dualize the set packing ones. The first thing we do is replace the constraints generated in lines 39-40 with a new set of variables, penalties, which take the values of the slacks on the set packing constraints. We then modify the objective function, adding Lagrangian multipliers times these penalties.

Instead of optimizing once, we do so iteratively. An important consideration is we may get nothing more than a dual bound from this process. Any integer solution is not guaranteed to be primal feasible unless it satisfies complementary slackness conditions -- for each dualized constraint either its multiplier or penalty must be zero.

Setting the intial multiplier values to 2 and using subgradient optimization with a step size of 1 / (iteration #) to adjust them, we arrive at the following model:

#!/usr/bin/env python

# This is the GAP per Wolsey, pg 182, using Lagrangian Relaxation

from gurobipy import GRB, Model

model = Model('GAP per Wolsey with Lagrangian Relaxation')
model.modelSense = GRB.MAXIMIZE
model.setParam('OutputFlag', False) # turns off solver chatter

b = [15, 15, 15]
c = [
    [ 6, 10,  1],
    [12, 12,  5],
    [15,  4,  3],
    [10,  3,  9],
    [ 8,  9,  5]
]
a = [
    [ 5,  7,  2],
    [14,  8,  7],
    [10,  6, 12],
    [ 8,  4, 15],
    [ 6, 12,  5]
]

# x[i][j] = 1 if i is assigned to j
x = []
for i in range(len(c)):
    x_i = []
    for j in c[i]:
        x_i.append(model.addVar(vtype=GRB.BINARY))
    x.append(x_i)

# As stated, the GAP has these following constraints. We dualize these into
# penalties instead, using variables so we can easily extract their values.
penalties = [model.addVar() for _ in x]
model.update()

# Dualized constraints: sum j: x_ij <= 1 for all i
for p, x_i in zip(penalties, x):
    model.addConstr(p == 1 - sum(x_i))

# sum i: a_ij * x_ij <= b[j] for all j
for j in range(len(b)):
    model.addConstr(sum(a[i][j] * x[i][j] for i in range(len(x))) <= b[j])

# u[i] = Lagrangian Multiplier for the set packing contraint i
u = [2.0] * len(x)

# Re-optimize until either we have run a certain number of iterations
# or complementary slackness conditions apply.
for k in range(1, 101):
    # max sum i,j: c_ij * x_ij
    model.setObjective(
        sum(
            # Original objective function
            sum(c_ij * x_ij for c_ij, x_ij in zip(c_i, x_i))
            for c_i, x_i in zip(c, x)
        ) + sum (
            # Penalties for dualized constraints
            u_j * p_j for u_j, p_j in zip(u, penalties)
        )
    )
    model.optimize()

    print 'iteration', k, 'obj =', model.objVal, \
        'u =', u, 'penalties =', [p.x for p in penalties]

    # Test for complementary slackness
    stop = True
    eps = 10e-6
    for u_i, p_i in zip(u, penalties):
        if abs(u_i) > eps and abs(p_i.x) > eps:
            stop = False
            break

    if stop:
        print 'primal feasible & optimal'
        break

    else:
        s = 1.0 / k
        for i in range(len(x)):
            u[i] = max(u[i] - s*(penalties[i].x), 0.0)

# Pull objective and variable values out of model
print 'objective =', model.objVal
print 'x = ['
for x_i in x:
    print '   ', [1 if x_ij.x >= 0.5 else 0 for x_ij in x_i]
print ']'

On this problem LR converges very quickly to an optimal solution, as we can see here:

iteration 1 obj = 48.0 u = [2.0, 2.0, 2.0, 2.0, 2.0] penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 2 obj = 47.0 u = [2.0, 2.0, 2.0, 2.0, 1.0] penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 3 obj = 46.5 u = [2.0, 2.0, 2.0, 2.0, 0.5] penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 4 obj = 46.1666666667 u = [2.0, 2.0, 2.0, 2.0, 0.16666666666666669] penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
iteration 5 obj = 46.0 u = [2.0, 2.0, 2.0, 2.0, 0.0] penalties = [0.0, 0.0, 0.0, 0.0, 1.0]
primal feasible & optimal
objective = 46.0
x = [
    [0, 1, 0]
    [0, 1, 0]
    [1, 0, 0]
    [0, 0, 1]
    [0, 0, 0]
]

Exercise for the reader: change the script to dualize the knapsack constraints instead of the set packing constraints. What is the result of this change in terms of convergence?