Data Analysis

Sorting Data in R

Experimental RLab #2

Sorting data is one of the fundamental techniques of any data analysis. In R, sorting is all done in-memory with a variety of functions (e.g. sort, order). SQL -like queries are also possible through the `sqldf` library.

# SORT() and ORDER()

## SORT()
# a random, normal vector
x <- rnorm(10)

# sort x using sort()

# retrieve the sorted order of the values in x:
# and apply to x
idx_x <- sort(x, index.return=TRUE)

# # ORDER()
# like sort(x, index.return=TRUE) this returns an index
# of the ordered elements
idx_x2 <- order(x)
identical(x[idx_x], x[idx_x2])

# PRACTICE: the mtcars dataset

# sort mtcars by horsepower, returning the top 5 cars and the horsepower
idx_hp <- order(mtcars$hp, na.last=NA)
mtcars[idx_hp, ]

# deterministic search, removing 'ties' where cars have equal horsepower
# sort on hp, then on mpg
idx_hp_mpg <- order(mtcars["hp"], mtcars["mpg"], na.last=NA)
mtcars[idx_hp_mpg, ]

# let's make hte sort a little clearer, and take only the subset we are interested in
selected <- subset(mtcars[idx_hp_mpg, ], select=c(mpg, hp))


## the sqldf library provides SQL-like queries over data frames
# so we can do the same:

sqlResults <- sqldf("SELECT row_names, mpg, hp
                     FROM mtcars
                     ORDER BY hp, mpg", row.names=TRUE)

identical(selected, sqlResults)

The official CRAN package site for sqldf is:
Google Code Page:
Google Group for sqldf:!forum/sqldf
GitHub repo:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s