standard evaluation in dplyr
A bunch of my data analyses in the past few months had been using R to analyze large data frames. I was excited when Hadley and Romaine released dplyr, their high performance package for dataframe analysis/manipulation. Although the performance is awesome, I think dplyr also excels in making it much easier to write R code and compose usefulf funcitons form a set of simple functions the call “verbs”.
Despite having used R a fair bit I almost always need to google/stack-overflow simple things related to the grammar of R. Subsetting, renaming, etc are not at all intuitive and I think this is part of the reasons why it is easier to pick up python/pandas due to the rather straightforward syntax. So dplyr’s straight-forward syntax of a small number of well defined “verbs” and its use of the piping operator %>%
make it an excellent choice for use in data analysis. However, there was one issue when I was using dplyr that was a bit problematic: although performing a dplyr-based data analysis pipeline was straight-forward, composing a function to do that analysis was impossible……. until now.
In dplyr 0.3, Hadley has introduced a way that allows you to program with dplyr. Thats fantastic news and I’ll give a short example below where we write a simple function that will use dplyr to take a dataset, groupby the members in a specified column, count the number of members in each group, and return the counts.
library(dplyr)
# Basic grouping functionality I want to achieve:
# take the iris dataset, group, summarise, and return
group_iris <- function(df=iris ){
df2 <- df %>%
group_by(Species) %>%
summarise(count=sum(Species))
return(df2)
}
# This function will fail. you cannot pass a function parameter ("columnname") to dplyr
group_iris <- function(columnname,df=iris ){
df2 <- df %>%
group_by(columnname) %>%
summarise(count=sum(columnname))
return(df2)
}
# Solution: Use alternative dplyr functions with suffix to allow function composability
group_iris <- function(columnname,df=iris ){
df2 <- df %>%
group_by_(~columname)%>%
summarise_(count=sum(~columnname))
return(df2)
}
df <- group_iris("Species")
df
Note that there are a few things that are differences:
- the use of
group_by_
instead ofgroup_by
and - the need to quote the variable name like
~columnname
instead ofcolumnname
.
These modifications are due to Hadley’s non-standard-evaluation package. The benefits may not be immediately obvious but I can say that in my latest data analysis I was writing a lot of functions to do the same thing. Now I can replace each function class with a single function and reduce the total code base by roughly half. This is a really great addition that means I will probably use dplyr more than standard R for most processing.