Package Comparisons for Text Processing in R¶

Anna E. Jurgensen

September 19, 2017

I process text data frequently, typically using regular expressions to tailor the changes that get made. However, I haven't quite settled on which R package I prefer to use. I've been switiching between the functions of the base package (such as grep and gsub) and those of regexPipes, stringi, and stringr. I'm particularly interested in the time it takes for these functions to run, wanting, of course, to minimize the wait for the data to be transformed and available for use. To better understand which of these packages accomplishes the task faster, I wrote a user-defined function (UDF) for each package in question, and compared the time it took for each to format the same data in the same way. While I was at it, I also timed these UDFs when they were written (where possible) with two different structures (repeated overwriting of function results to a single local variable vs. pipelining the function results using the magrittr forward pipe operator), as well as executing the UDFs with and without parallel processing. Ultimately I found that the stringi functions executed the fastest, followed by those of the base package. Further, I found that piping increased the processing time, and, unsurprisingly, that parallelization decreased the processing time.

Data Used¶

I tested the functions by running them over the text in the 'review' field of the the movie review data set, available both on Bo Pang's page at Cornell and in the text2vec package. The data consists of instance ids, the sentiment, and the actual text of the review.

df_test <- read.csv("movie_review.csv", header = T, stringsAsFactors = F)
colnames(df_test)

The values of the 'review' field consist of the full, unprocessed text of the reviews.

df_test[2,]$review

All together, the text of the 'review' field for the entire data frame constitutes just over 7 megabytes of data.

object.size(df_test$review)

7007336 bytes

Function and Test Designs¶

For me processing text almost invariably involves a minimum of simplifying text to alpha-numeric characters and spaces, reducing repeated spaces, trimming leading and trailing spaces, and replacing select string patterns using regular expressions. As such, the UDFs I wrote included all of these operation. In particular, and in the following order, the UDFs accomplished:

simplified the characters to alphanumeric charaters and spaces
transformed the characters to lowercase
reduced repeated spaces
removed leading and trailing spaces
replace 'thank you' or 'thankssss' with 'thanks'
replace ' nt' with 'nt'
replace common profanities with 'PROFANITY'
replace 'please', 'thanks', 'could', and 'would' with 'POLITE'
replace 'video', 'videos', 'vid', 'vids', or 'vidz' with 'VIDEO'
replace 'money', 'dollar', 'dollars', or 'cash' with 'MONEY'

Further, I wanted to be absolutely sure that the time I was seeing for each UDF represented the time it took for each to transform the input into exactly the same output. To reassure myself that the function outputs were the same without having to look at the text itself, I took the results of the first text processing function I tested to be the standard output (saved in the data frame column "review_base") and compared them to the results of all subsequent functions (each saved in the column "review_2") using R's identical function.

df_test$review_base <- df_test$review
df_test$review_2 <- df_test$review

identical(df_test$review_base, df_test$review_2)

As can be seen from the following example, this should reliably indicate whether the function outputs are precisely the same, since changing a single character in a single element in one of the two olumns is enough to render the result of the identical function false.

df_test[1,]$review_2 <- sub("[aeiou]", "b", df_test[1,]$review_2)

identical(df_test$review_base, df_test$review_2)

Lastly, I worked through packages by starting with the base package functions before proceeding alphabetically through the other packages, testing regexPipes, then stringi, and finishing with stringr. For each package I tested what I felt was the most ordinary UDF structure: a series of function calls, overwriting a single local variable with the results of each. This structure was tested first without parallelization, and subsequently with parallelization. Then, for the same package, the UDF utilizing the piping structure was tested first without and subsequently with parallelization.

Tests (with Code)¶

To record the execution times for all of the UDF tests in one place, I created a data frame with the package names, boolean values indicating whether or not the test utilized piping or parallelization, and a column for the execution time (initially recorded as 0 for all tests).

test_results <- data.frame(c(rep('base',4), rep('regexPipes', 2), rep('stringi', 4), rep('stringr', 4)))
colnames(test_results) <- 'package'
test_results$piping <- c(0,0,1,1,1,1,0,0,1,1,0,0,1,1)
test_results$parallel <- rep(c(0,1), 7)
test_results$time <- 0
test_results

Base Package Functions¶

I began testing with the process I thought was the most ordinary of packages and UDF structures: that which used the base package functions, didn't use piping, and didn't use parallelization. The specific base package functions utilized were gsub and tolower.

process_text <- function(x){
  x <- gsub("[^A-Za-z0-9 ]", " ", x)  # alphanumeric and spaces only 
  
  x <- tolower(x)  # to lowercase
  
  x <- gsub("\\s+", " ", x) # simplify and trim spaces
  x <- gsub("^ ", "", x)
  x <- gsub(" $", "", x)
  
  x <- gsub("thank y*o*u","thanks", x) # replace select expressions
  
  x <- gsub("\\b(\\w+)n t\\b","\\1nt", x)
  
  x <- gsub("\\bfu*ck.*\\b", "PROFANITY", x)
  x <- gsub("\\bshi*t+y*\\b", "PROFANITY", x)
  x <- gsub("\\bcrap+.*\\b", "PROFANITY", x)
  x <- gsub("\\bass\\b", "PROFANITY", x)
  x <- gsub("\\basses\\b", "PROFANITY", x)
  x <- gsub("\\bda+mn[a-z]*\\b", "PROFANITY", x)
  
  x <- gsub("\\bple+a+s+e+\\b", "POLITE", x)
  x <- gsub("\\bthanks+\\b", "POLITE", x)
  x <- gsub("\\bcould\\b", "POLITE", x)
  x <- gsub("\\bwould\\b", "POLITE", x)
  
  x <- gsub("\\bvideo\\w*\\b", "VIDEO", x)
  x <- gsub("\\bvid[sz]*\\b", "VIDEO", x)
  
  x <- gsub("\\bmoney\\w*\\b", "MONEY", x)
  x <- gsub("\\bdolla\\w*\\b", "MONEY", x)
  x <- gsub("\\bcash\\b", "MONEY", x)
}

t1 = Sys.time()
df_test$review_base <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' & 
               test_results$piping == 0 & 
               test_results$parallel == 0,]$time <- as.numeric(time)
time

Time difference of 19.49 secs

Base Package Functions with Parallelization¶

Unsurprisingly, parallelization of the same code greatly speeds up the processing.

require(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' & 
               test_results$piping == 0 & 
               test_results$parallel == 1,]$time <- as.numeric(time)
test_results[test_results$package == 'base' & 
               test_results$piping == 0 & 
               test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Loading required package: parallel

Time difference of 6.3 secs

Output identical to standard:  TRUE

Base Package Functions with Piping¶

Compared to parallelization, piping using the forward pipe operator from the magrittr package has the opposite effect on processing time, resulting in a significant increase. Also, note that some of the base package functions are incompatible with piping, and so in this UDF is only used where possible.

process_text <- function(x){
  require(magrittr)
  
  x <- gsub("[^A-Za-z0-9 ]", " ", x) # alphanumeric and spaces only
    
  x <-  tolower(x)   # to lowercase
  
  x <- x %>% gsub("\\s+", " ", .) %>%  # simplify and trim spaces
    gsub("^ ", "", .) %>%  
    gsub(" $", "", .) %>% 
    
    gsub("thank y*o*u","thanks", .) # reformat select expressions
    
  x <- gsub("\\b(\\w+)n t\\b","\\1nt", x) 
    
  x %>% gsub("\\bfu*ck.*\\b", "PROFANITY", .) %>%
    #gsub("\\bfu*ck.*\\b", "PROFANITY", .) %>%
    gsub("\\bshi*t+y*\\b", "PROFANITY", .) %>% 
    gsub("\\bcrap+.*\\b", "PROFANITY", .) %>% 
    gsub("\\bass\\b", "PROFANITY", .) %>% 
    gsub("\\basses\\b", "PROFANITY", .) %>%  
    gsub("\\bda+mn[a-z]*\\b", "PROFANITY", .) %>% 
    
    gsub("\\bple+a+s+e+\\b", "POLITE", .) %>%   
    gsub("\\bthanks+\\b", "POLITE", .) %>% 
    gsub("\\bcould\\b", "POLITE", .) %>%
    gsub("\\bwould\\b", "POLITE", .) %>%
  
    gsub("\\bvideo\\w*\\b", "VIDEO", .) %>%
    gsub("\\bvid[sz]*\\b", "VIDEO", .) %>%
  
    gsub("\\bmoney\\w*\\b", "MONEY", .) %>%
    gsub("\\bdolla\\w*\\b", "MONEY", .) %>%
    gsub("\\bcash\\b", "MONEY", .)
}

t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' & 
               test_results$piping == 1 & 
               test_results$parallel == 0,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Loading required package: magrittr

Time difference of 33.29 secs

Output identical to standard:  TRUE

Base Package Functions with Piping, Parallelization¶

Combining piping and parallelization we see that parallelization greatly speeds the piping construction - so much so that the piping UDF runs in approximately half the time.

n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' & 
               test_results$piping == 1 & 
               test_results$parallel == 1,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 10.61 secs

Output identical to standard:  TRUE

Piping with regexPipes¶

The issues with back reference that arise when using the base package gsub function with magrittr pipes can be allievated by using package regexPipes. The result is only slightly slower than using the base package function with pipes, with workarounds where needed.

process_text <- function(x){
  require(magrittr)
  require(regexPipes)
  
  x %>% gsub("[^A-Za-z0-9 ]", " ") %>%  # alphanumeric and spaces only; uses regexPipes::gsub
    
    tolower %>%
    gsub("\\s+", " ") %>%  # simplify and trim spaces
    gsub("^ ", "") %>%  
    gsub(" $", "") %>% 
    
    gsub("thank y*o*u","thanks") %>%  # reformat select expressions
    
    gsub("\\b(\\w+)n t\\b", "\\1nt") %>% 
    
    gsub("\\bfu*ck.*\\b", "PROFANITY") %>%
    gsub("\\bshi*t+y*\\b", "PROFANITY") %>% 
    gsub("\\bcrap+.*\\b", "PROFANITY") %>% 
    gsub("\\bass\\b", "PROFANITY") %>% 
    gsub("\\basses\\b", "PROFANITY") %>%  
    gsub("\\bda+mn[a-z]*\\b", "PROFANITY") %>% 
    
    gsub("\\bple+a+s+e+\\b", "POLITE") %>%   
    gsub("\\bthanks+\\b", "POLITE") %>% 
    gsub("\\bcould\\b", "POLITE") %>%
    gsub("\\bwould\\b", "POLITE") %>%
  
    gsub("\\bvideo\\w*\\b", "VIDEO") %>%
    gsub("\\bvid[sz]*\\b", "VIDEO") %>%
  
    gsub("\\bmoney\\w*\\b", "MONEY") %>%
    gsub("\\bdolla\\w*\\b", "MONEY") %>%
    gsub("\\bcash\\b", "MONEY")
}

t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'regexPipes' & 
               test_results$piping == 1 & 
               test_results$parallel == 0,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Loading required package: regexPipes

Attaching package: 'regexPipes'

The following objects are masked from 'package:base':

    gregexpr, grep, grepl, gsub, regexec, regexpr, sub

Time difference of 40.5 secs

Output identical to standard:  TRUE

regexPipes with Parallelization¶

n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'regexPipes' & 
               test_results$piping == 1 & 
               test_results$parallel == 1,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 11.76 secs

Output identical to standard:  TRUE

stringi Functions¶

Another option is the stri_replace_all_regex and stri_trans_tolower functions from the stringi package, which utilizes the ICU regex engine (note the different syntax for the back reference: "\1" becomes "$1" when using the ICU regex syntax).

process_text <- function(x){
  require(stringi)
  
  x <- stri_replace_all_regex(x, "[^A-Za-z0-9 ]", " ")  # alphanumeric and spaces only 
  
  x <- stri_trans_tolower(x)  # to lowercase
  
  x <- stri_replace_all_regex(x, "\\s+", " ") # simplify and trim spaces
  x <- stri_replace_all_regex(x, "^ ", "")
  x <- stri_replace_all_regex(x, " $", "")

  
  x <- stri_replace_all_regex(x, "thank y*o*u","thanks")  # reformat select expressions
  
  x <- stri_replace_all_regex(x, "\\b(\\w+)n t\\b","$1nt")
  
  x <- stri_replace_all_regex(x, "\\bfu*ck.*\\b", "PROFANITY")
  x <- stri_replace_all_regex(x, "\\bshi*t+y*\\b", "PROFANITY")
  x <- stri_replace_all_regex(x, "\\bcrap+.*\\b", "PROFANITY")
  x <- stri_replace_all_regex(x, "\\bass\\b", "PROFANITY")
  x <- stri_replace_all_regex(x, "\\basses\\b", "PROFANITY")
  x <- stri_replace_all_regex(x, "\\bda+mn[a-z]*\\b", "PROFANITY")
  
  x <- stri_replace_all_regex(x, "\\bple+a+s+e+\\b", "POLITE")
  x <- stri_replace_all_regex(x, "\\bthanks+\\b", "POLITE")
  x <- stri_replace_all_regex(x, "\\bcould\\b", "POLITE")
  x <- stri_replace_all_regex(x, "\\bwould\\b", "POLITE")
  
  x <- stri_replace_all_regex(x, "\\bvideo\\w*\\b", "VIDEO")
  x <- stri_replace_all_regex(x, "\\bvid[sz]*\\b", "VIDEO")
  
  x <- stri_replace_all_regex(x, "\\bmoney\\w*\\b", "MONEY")
  x <- stri_replace_all_regex(x, "\\bdolla\\w*\\b", "MONEY")
  x <- stri_replace_all_regex(x, "\\bcash\\b", "MONEY")
}

t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' & 
               test_results$piping == 0 & 
               test_results$parallel == 0,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Loading required package: stringi

Time difference of 14.31 secs

Output identical to standard:  TRUE

stringi with Parallelization¶

require(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' & 
               test_results$piping == 0 & 
               test_results$parallel == 1,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 4.65 secs

Output identical to standard:  TRUE

stringi with Piping¶

The stringi functions were also tried using piping the forward pipe operators... Once again !!!

process_text <- function(x){
  require(stringi)
  require(magrittr)
  
  x %>% stri_replace_all_regex("[^A-Za-z0-9 ]", " ") %>% # alphanumeric and spaces only
    
    stri_trans_tolower() %>%   # to lowercase
    
    stri_replace_all_regex("\\s+", " ") %>%  # simplify and trim spaces
    stri_replace_all_regex("^ ", "") %>%  
    stri_replace_all_regex(" $", "") %>% 
    
    stri_replace_all_regex("thank y*o*u","thanks") %>%  # reformat select expressions
    
    stri_replace_all_regex("\\b(\\w+)n t\\b", "$1nt") %>% 
    
    stri_replace_all_regex("\\bfu*ck.*\\b", "PROFANITY") %>%
    stri_replace_all_regex("\\bshi*t+y*\\b", "PROFANITY") %>% 
    stri_replace_all_regex("\\bcrap+.*\\b", "PROFANITY") %>% 
    stri_replace_all_regex("\\bass\\b", "PROFANITY") %>% 
    stri_replace_all_regex("\\basses\\b", "PROFANITY") %>%  
    stri_replace_all_regex("\\bda+mn[a-z]*\\b", "PROFANITY") %>% 
    
    stri_replace_all_regex("\\bple+a+s+e+\\b", "POLITE") %>%   
    stri_replace_all_regex("\\bthanks+\\b", "POLITE") %>% 
    stri_replace_all_regex("\\bcould\\b", "POLITE") %>%
    stri_replace_all_regex("\\bwould\\b", "POLITE") %>%
  
    stri_replace_all_regex("\\bvideo\\w*\\b", "VIDEO") %>%
    stri_replace_all_regex("\\bvid[sz]*\\b", "VIDEO") %>%
  
    stri_replace_all_regex("\\bmoney\\w*\\b", "MONEY") %>%
    stri_replace_all_regex("\\bdolla\\w*\\b", "MONEY") %>%
    stri_replace_all_regex("\\bcash\\b", "MONEY")
}

t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' & 
               test_results$piping == 1 & 
               test_results$parallel == 0,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 30.74 secs

Output identical to standard:  TRUE

stringi with Piping, Parallelization¶

cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' & 
               test_results$piping == 1 & 
               test_results$parallel == 1,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 9.47 secs

Output identical to standard:  TRUE

stringr Functions¶

The Tidyverse package stringr provides wrappers for the stringi regex functions. In this case I used str_replace_all and str_to_lower. Even for assignment and functions run in series, the stringr functions take significantly longer to run than either the base package functions or those from stringi itself.

process_text <- function(x){
  require(stringr)
  
  x <- str_replace_all(x, "[^A-Za-z0-9 ]", " ")  # alphanumeric and spaces only 
  
  x <- str_to_lower(x)  # to lowercase
  
  x <- str_replace_all(x, "\\s+", " ") # simplify and trim spaces
  x <- str_replace_all(x, "^ ", "")
  x <- str_replace_all(x, " $", "")
  
  
  x <- str_replace_all(x, "thank y*o*u","thanks")  # reformat select expressions
  
  x <- str_replace_all(x, "\\b(\\w+)n t\\b","\\1nt")
  
  x <- str_replace_all(x, "\\bfu*ck.*\\b", "PROFANITY")
  x <- str_replace_all(x, "\\bshi*t+y*\\b", "PROFANITY")
  x <- str_replace_all(x, "\\bcrap+.*\\b", "PROFANITY")
  x <- str_replace_all(x, "\\bass\\b", "PROFANITY")
  x <- str_replace_all(x, "\\basses\\b", "PROFANITY")
  x <- str_replace_all(x, "\\bda+mn[a-z]*\\b", "PROFANITY")
  
  x <- str_replace_all(x, "\\bple+a+s+e+\\b", "POLITE")
  x <- str_replace_all(x, "\\bthanks+\\b", "POLITE")
  x <- str_replace_all(x, "\\bcould\\b", "POLITE")
  x <- str_replace_all(x, "\\bwould\\b", "POLITE")
  
  x <- str_replace_all(x, "\\bvideo\\w*\\b", "VIDEO")
  x <- str_replace_all(x, "\\bvid[sz]*\\b", "VIDEO")
  
  x <- str_replace_all(x, "\\bmoney\\w*\\b", "MONEY")
  x <- str_replace_all(x, "\\bdolla\\w*\\b", "MONEY")
  x <- str_replace_all(x, "\\bcash\\b", "MONEY")
}

t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' & 
               test_results$piping == 0 & 
               test_results$parallel == 0,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Loading required package: stringr

Time difference of 55.77 secs

Output identical to standard:  TRUE

stringr with Overwriting, Parallelization¶

cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' & 
               test_results$piping == 0 & 
               test_results$parallel == 1,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 15.55 secs

Output identical to standard:  TRUE

stringr with Piping¶

The longest processing time results when stringr is used with piping. As is true with the stringi functions, ...

Using the same function, but a process with pipeline from magrittr (also utilized in dplyr), which has the advantage of being faster to write...

process_text <- function(x){
  require(stringr)
  require(magrittr)
  x %>% str_replace_all("[^A-Za-z0-9 ]", " ") %>%  # alphanumeric and spaces only
    
    str_to_lower() %>%   # to lowercase

    str_replace_all("\\s+", " ") %>%  # simplify and trim spaces
    str_replace_all("^ ", "") %>%  
    str_replace_all(" $", "") %>% 
  
    
    str_replace_all("thank y*o*u","thanks") %>%   # reformat select expressions
    
    str_replace_all("\\b(\\w+)n t\\b","\\1nt") %>%
    
    str_replace_all("\\bfu*ck.*\\b", "PROFANITY") %>%
    str_replace_all("\\bshi*t+y*\\b", "PROFANITY") %>% 
    str_replace_all("\\bcrap+.*\\b", "PROFANITY") %>% 
    str_replace_all("\\bass\\b", "PROFANITY") %>% 
    str_replace_all("\\basses\\b", "PROFANITY") %>%  
    str_replace_all("\\bda+mn[a-z]*\\b", "PROFANITY") %>% 
    
    str_replace_all("\\bple+a+s+e+\\b", "POLITE") %>%   
    str_replace_all("\\bthanks+\\b", "POLITE") %>% 
    str_replace_all("\\bcould\\b", "POLITE") %>%
    str_replace_all("\\bwould\\b", "POLITE") %>%
  
    str_replace_all("\\bvideo\\w*\\b", "VIDEO") %>%
    str_replace_all("\\bvid[sz]*\\b", "VIDEO") %>%
  
    str_replace_all("\\bmoney\\w*\\b", "MONEY") %>%
    str_replace_all("\\bdolla\\w*\\b", "MONEY") %>%
    str_replace_all("\\bcash\\b", "MONEY")
}

t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' & 
               test_results$piping == 1 & 
               test_results$parallel == 0,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 71.75 secs

Output identical to standard:  TRUE

stringr with Piping, Parallelization¶

cl <- makeCluster(n_cores)

t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text) 
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' & 
               test_results$piping == 1 & 
               test_results$parallel == 1,]$time <- as.numeric(time)

time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))

Time difference of 20.56 secs

Output identical to standard:  TRUE

Conclusion¶

The stringi functions with a series of assignments is the package and structure that runs the fastest, though the base package functions in the same code structure is nearly as fast (and has much shorter function names, which are faster to type). Further, since creating the structure with multiple assignments can be quickly written achieved with the help of find and replace with regex (simply insert 'x <- ' at the beginning of each line), the time saved by typing less when using piping seems to be entirely negated by the substantially longer run times. For me, them, since I'm partial to the syntax of the base functions, it looks like the base functions used with multiple assignments is going to be my default package and strategy.

library(ggplot2)
library(repr)

options(repr.plot.width=7, repr.plot.height=3.5)
ggplot(test_results, aes(as.factor(piping), time, 
                         shape = as.factor(parallel), 
                         color = as.factor(parallel), 
                         label = time)) +
    geom_point(size = 2.5) + facet_grid(.~package) + ylim(0,75) +
    xlab("Piping") + ylab("Time (s)") + 
    ggtitle("Execution Time by Package") +
    scale_color_manual(values=c("red","blue"), name="Parallel", labels = c("0","1")) + 
    scale_shape_manual(values = c(16, 17), name="Parallel", labels= c("0","1")) +
    geom_text(aes(label=time),size = 3, hjust=-0.1, vjust=-0.5, show.legend = F) +
    theme(strip.text = element_text(face = "bold"), plot.title = element_text(face = "bold"))

package	piping	parallel
base	0	0
base	0	1
base	1	0
base	1	1
regexPipes	1	0
regexPipes	1	1
stringi	0	0
stringi	0	1
stringi	1	0
stringi	1	1
stringr	0	0
stringr	0	1
stringr	1	0
stringr	1	1