standard evaluation in dplyr

A bunch of my data analyses in the past few months had been using R to analyze large data frames. I was excited when Hadley and Romaine released dplyr, their high performance package for dataframe analysis/manipulation. Although the performance is awesome, I think dplyr also excels in making it much easier to write R code and compose usefulf funcitons form a set of simple functions the call “verbs”.

Despite having used R a fair bit I almost always need to google/stack-overflow simple things related to the grammar of R. Subsetting, renaming, etc are not at all intuitive and I think this is part of the reasons why it is easier to pick up python/pandas due to the rather straightforward syntax. So dplyr’s straight-forward syntax of a small number of well defined “verbs” and its use of the piping operator %>% make it an excellent choice for use in data analysis. However, there was one issue when I was using dplyr that was a bit problematic: although performing a dplyr-based data analysis pipeline was straight-forward, composing a function to do that analysis was impossible……. until now.

In dplyr 0.3, Hadley has introduced a way that allows you to program with dplyr. Thats fantastic news and I’ll give a short example below where we write a simple function that will use dplyr to take a dataset, groupby the members in a specified column, count the number of members in each group, and return the counts.

 library(dplyr)

 # Basic grouping functionality I want to achieve:
 # take the iris dataset, group, summarise, and return
 group_iris <- function(df=iris ){
   df2 <- df %>%
     group_by(Species) %>%
     summarise(count=sum(Species))
   return(df2)
 }

 # This function will fail. you cannot pass a function parameter ("columnname") to dplyr
 group_iris <- function(columnname,df=iris ){
   df2 <- df %>%
     group_by(columnname) %>%
     summarise(count=sum(columnname))
   return(df2)
 }

 # Solution: Use alternative dplyr functions with suffix to allow function composability
 group_iris <- function(columnname,df=iris ){
   df2 <- df %>%
     group_by_(~columname)%>%
     summarise_(count=sum(~columnname))
   return(df2)
 }
 df <- group_iris("Species")
 df

Note that there are a few things that are differences:

  1. the use of group_by_ instead of group_by and
  2. the need to quote the variable name like ~columnname instead of columnname.

These modifications are due to Hadley’s non-standard-evaluation package. The benefits may not be immediately obvious but I can say that in my latest data analysis I was writing a lot of functions to do the same thing. Now I can replace each function class with a single function and reduce the total code base by roughly half. This is a really great addition that means I will probably use dplyr more than standard R for most processing.