Data Analysis / Statistics & Data Science

Analysing the Elections

7 PM Saturday night rolled around and we tuned the television into the live election broadcasts. The polls were closed and everyone was expecting a close fought battle between a hopeful Labour party and a beleaguered National party. Negative publicity around the DotCom scandal and the shadow of  alleged dirty politics made it seem like if National were to cross the line, it would do so with a sever limp. Nobody predicted National’s landslide win which could see them in power without having to bend their knee to the smaller coalition parties. Along with the triumph was the drama of a defeated Labour, a less-than-popular leader in Cunliffe, the fall of “brand Dotcom”, a politically-impotent Winston Peters… All-in-all it has to be one the more dynamic and interesting elections New Zealand has experienced. But despite all the hoo-ha, I didn’t wait up for the live results. Instead, I called it an early night and woke up early to start digging into the voting data!

 

8 AM this morning, I turned on the computer and logged onto The New Zealand Heralds website to see a victorious John Key on every headline with stories of one of the most comprehensive election victories in decades. I poked around the newsites for an hour or so, while having breakfast, and then got comfortable for the main show… the voting statistics! With coffee, snacks and no real plans I headed over to the New Zealand Electoral Commission’s website, cracked my knuckles, stretched my back, and dove head first into their databases. I couldn’t believe it! A mere 10 hours after the results were announced the entire data was published online, freely and publicly available. 3-key strokes and 2 excited seconds later, RStudio was alive and waiting patiently on my desktop ready to extract everything the elections had to offer…

 

The first job was to grab an index of all the election results, I decided to focus on the results by Electorate. The good news is the Electoral Commission have already done the hard work for us, so I read-in their index of results (by Electorate), gathered the Electorate names, IDs and the URLs to their data. The code for this is below:

library(XML)
library(plyr)

#  Extract an Index of Electorates, including:
#        - Electorate name
#        - Electorate ID
#        - URL to the summary voting data
#
# Result: 2 data frames:
#        electURLs  <- (ID, URL)
#        electNames <- (ID, Eletorate)

electURL <- "http://www.electionresults.govt.nz/electionresults_2014/electorateindex.html"
electDOC <- htmlTreeParse(electURL, useInternalNodes = TRUE)

rawLinks <- xpathSApply(electDOC, '//a/@href')
electLinks <- rawLinks[grep("electorate-", rawLinks)]

electURLs <- paste("http://www.electionresults.govt.nz/electionresults_2014/", electLinks, sep='')
electURLs <- data.frame(ID=c(1:length(electURLs)), url=electURLs)

electNames <- xpathSApply(electDOC, "//a[contains(@href,'electorate-')]", xmlValue)
electNames <- data.frame(ID = c(1:length(electNames)), Electorate=electNames)

# clean up
rm(electURL)
rm(electDOC)
rm(electLinks)
rm(rawLinks)

With an index of the raw data, I had to then go and extract the raw information from each URL. The beauty of R, is that is is strongly rooted in the declarative paradigm which means it lends itself very well to functional styles of programming. The end result is a method of programming that is highly generalisable. A problem is approached from a very high level of abstraction and is then generalised to inputs of the same type, but not necessarily the same length or size. This is particularly useful for our election data where each electorate contains common elements (Party Name, Number of Votes, MP Names and MP Votes, total number of votes, leading members…) in a common format (HTML) but they are of various lengths and sizes, not all parties are represented in all districts… To sum this up, each electorate shares a common structure (type) but they all vary significantly.

 

There is no way that procedural subsetting (subsetting of arrays by position) would be suitable for this problem – well not unless you enjoy nesting loops… But thankfully, HTML is also a declarative language and one that interfaces really well with functional styles. Looking into the HTML, we can see that most of the data we need is held in columns within a table. To extract the table we are going to take advantage of the fact that each row in the table is highlighted (heavy or light). We assume the table is always 6 columns wide but a variable length (k). Finally, with the table data in hand, we will put it into an 6-by-matrix.  I have wrapped this into a function called extract():


extract <- function(x) {

    #     takes a data frame (ID, url) as input and returns a data frame of all variables

    #     read in the HTML and extract the table data (td elements)
    #     coerce the data into a 6 x k matrix and return a data frame:
    #        ("ID", PartyName", "PartyVotes", "Blank", "CandidateName", "CandidateParty", "CandidateVotes")

    doc <- htmlTreeParse(x$url, useInternalNodes=T)

    rawData <- c(   xpathSApply(doc, "//table/tr[contains(@class, 'hhevy')]/td", xmlValue)
                  , xpathSApply(doc, "//table/tr[contains(@class, 'hlite')]/td", xmlValue) )

    lclData <- as.data.frame( t( matrix(   rawData
                                         , nrow=6
                                         , ncol=length(rawData)/6 )))

    rm(doc)
    rm(rawData)

    return (lclData)
}

We now have everything we need to extract all the Electorate data published by the New Zealand Electoral Commission. All that is left is to call extract() on each URL gathered previously and clean up the data. Let’s do this now:

# Gather the voting data for all 71 electorates into a single data frame

votingData <- ddply(electURLs, "ID", extract)

To clean up the data, we first remove the blank 4th column and then merge the voting data with the Electorate names to yield a complete data set. And finally, the clean data set is written to file:

# clean up the data
#   1. Remove the 4 column which is blank and provide names for the columns
#   2. Merge the Electoral Names data with the voting data
#   3. Write out the clean data

votingData <- votingData[c(-4)]

names(votingData) <- c("ID", "Party", "PartyVotes", "MPName", "MPParty", "MPVotes")

cleanData <- merge(electNames, votingData)

# Finally, I need to tweak the vote counts to remove the commas and ensure they can
# be read in as numeric data
cleanData$PartyVotes <- gsub(",", "", cleanData$PartyVotes)
cleanData$MPVotes <- gsub(",", "", cleanData$MPVotes)

write.csv(cleanData, file="clean_electorate_data.csv")

rm(votingData)
rm(cleanData)
rm(electNames)
rm(electURLs)

So while the rest of the country celebrated (or commiserated) the election victory, I spent the morning blissfully awash in the resulting voting data. In future posts I will explore this data and see what correlations and insights might be gleaned from the nation’s votes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s