Data Analysis

Aggregating Data in R

Like any multi-paradigm language, there are a number of options for looping in R. R’s declarative constructs provide a convenient mechanism for aggregating data. This Experimental-RLab aggregates the data before calculating the correlation between two variables. In SQL, this would look like:

SELECT month,
       fn_correlate(Ozone, Wind) AS 'Correlation'
FROM airquality
GROUP BY month;

Method 1: Imperative Loops
aggregateC <- function (x) {

    results <- c()    # initialise an empty vector

    for (lclMonth in 1:12) {

        # clean the data, and extract the current month
        lclData <- x[complete.cases(), ]
        lclData <- lclData[lclData$Month == lclMonth, ]
    
        # calculate the correlation, and extend `results`
        results <- c(results, cor(lclData$Ozone, lclData$Wind))

    }
    return ( Filter(Negate(is.na), results) )
}

Things I like about the imperative approach:

  • it is simple and there is clarity to the code
  • it is easy to grasp what this function is doing
  • as a “practical” method, imperative loops scale well with increasingingly large datasets (on a single machine with ample memory)

Things I’m not crazy about:

  • in general, I am not crazy about explicit loops
  • explicit looping requires you to think about individual elements, as opposed to the dataset as a whole.
  • Imerpative concepts do not scale in my head! I find it more difficult to grasp a problem this way when it gets bigger
  • practically, the imperative paradigm does not distribute easily, which is important when you are considering very large problems.

Method 2: Declarative Loops
aggregateC <- function(x) {

    # clean data
    lclData <- x[complete.cases(x), ]

    # aggregate the data by month, and apply cor() to each aggregate
    aggregateData <- split(lclData, lclData$Month)

    results <- sapply(aggregateData, function(z) cor(z$Ozone, z$Wind))

    return ( Filter(Negate(is.na), results) )
}

Things I like about declarative loops:

  • conceptually, the code addresses the entire dataset (not individual elements)
  • conceptually, a declarative approach scales better in my head (simplifies the way I think about a problem)
  • more likely to scale gracefully to large distributed applications

Things I’m not crazy about:

  • requires more effort to understand the code (you need to know, or lookup, functions like split() and sapply())
  • however, the effort required to understand the code is mitigated by the use of standard R functions, that themselves are well understood.

Aggregating data is essential for nearly any interesting data science question. For example if  you had a database full of rugby results, you are more likely to be interested in the winning stats of a given team, than you are in the winning stats of a contiguous series of matches. I have to admit I like R’s split() function for aggregating data. Although I love Python, and hear a lot about Python’s popularity in data-intensive fields, I am finding R incredibly powerful and enjoyable.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s