I process text data frequently, typically using regular expressions to tailor the changes that get made. However, I haven't quite settled on which R package I prefer to use. I've been switiching between the functions of the base package (such as grep and gsub) and those of regexPipes, stringi, and stringr. I'm particularly interested in the time it takes for these functions to run, wanting, of course, to minimize the wait for the data to be transformed and available for use. To better understand which of these packages accomplishes the task faster, I wrote a user-defined function (UDF) for each package in question, and compared the time it took for each to format the same data in the same way. While I was at it, I also timed these UDFs when they were written (where possible) with two different structures (repeated overwriting of function results to a single local variable vs. pipelining the function results using the magrittr forward pipe operator), as well as executing the UDFs with and without parallel processing. Ultimately I found that the stringi functions executed the fastest, followed by those of the base package. Further, I found that piping increased the processing time, and, unsurprisingly, that parallelization decreased the processing time.
I tested the functions by running them over the text in the 'review' field of the the movie review data set, available both on Bo Pang's page at Cornell and in the text2vec package. The data consists of instance ids, the sentiment, and the actual text of the review.
df_test <- read.csv("movie_review.csv", header = T, stringsAsFactors = F)
colnames(df_test)
The values of the 'review' field consist of the full, unprocessed text of the reviews.
df_test[2,]$review
All together, the text of the 'review' field for the entire data frame constitutes just over 7 megabytes of data.
object.size(df_test$review)
For me processing text almost invariably involves a minimum of simplifying text to alpha-numeric characters and spaces, reducing repeated spaces, trimming leading and trailing spaces, and replacing select string patterns using regular expressions. As such, the UDFs I wrote included all of these operation. In particular, and in the following order, the UDFs accomplished:
Further, I wanted to be absolutely sure that the time I was seeing for each UDF represented the time it took for each to transform the input into exactly the same output. To reassure myself that the function outputs were the same without having to look at the text itself, I took the results of the first text processing function I tested to be the standard output (saved in the data frame column "review_base") and compared them to the results of all subsequent functions (each saved in the column "review_2") using R's identical function.
df_test$review_base <- df_test$review
df_test$review_2 <- df_test$review
identical(df_test$review_base, df_test$review_2)
As can be seen from the following example, this should reliably indicate whether the function outputs are precisely the same, since changing a single character in a single element in one of the two olumns is enough to render the result of the identical function false.
df_test[1,]$review_2 <- sub("[aeiou]", "b", df_test[1,]$review_2)
identical(df_test$review_base, df_test$review_2)
Lastly, I worked through packages by starting with the base package functions before proceeding alphabetically through the other packages, testing regexPipes, then stringi, and finishing with stringr. For each package I tested what I felt was the most ordinary UDF structure: a series of function calls, overwriting a single local variable with the results of each. This structure was tested first without parallelization, and subsequently with parallelization. Then, for the same package, the UDF utilizing the piping structure was tested first without and subsequently with parallelization.
To record the execution times for all of the UDF tests in one place, I created a data frame with the package names, boolean values indicating whether or not the test utilized piping or parallelization, and a column for the execution time (initially recorded as 0 for all tests).
test_results <- data.frame(c(rep('base',4), rep('regexPipes', 2), rep('stringi', 4), rep('stringr', 4)))
colnames(test_results) <- 'package'
test_results$piping <- c(0,0,1,1,1,1,0,0,1,1,0,0,1,1)
test_results$parallel <- rep(c(0,1), 7)
test_results$time <- 0
test_results
I began testing with the process I thought was the most ordinary of packages and UDF structures: that which used the base package functions, didn't use piping, and didn't use parallelization. The specific base package functions utilized were gsub and tolower.
process_text <- function(x){
x <- gsub("[^A-Za-z0-9 ]", " ", x) # alphanumeric and spaces only
x <- tolower(x) # to lowercase
x <- gsub("\\s+", " ", x) # simplify and trim spaces
x <- gsub("^ ", "", x)
x <- gsub(" $", "", x)
x <- gsub("thank y*o*u","thanks", x) # replace select expressions
x <- gsub("\\b(\\w+)n t\\b","\\1nt", x)
x <- gsub("\\bfu*ck.*\\b", "PROFANITY", x)
x <- gsub("\\bshi*t+y*\\b", "PROFANITY", x)
x <- gsub("\\bcrap+.*\\b", "PROFANITY", x)
x <- gsub("\\bass\\b", "PROFANITY", x)
x <- gsub("\\basses\\b", "PROFANITY", x)
x <- gsub("\\bda+mn[a-z]*\\b", "PROFANITY", x)
x <- gsub("\\bple+a+s+e+\\b", "POLITE", x)
x <- gsub("\\bthanks+\\b", "POLITE", x)
x <- gsub("\\bcould\\b", "POLITE", x)
x <- gsub("\\bwould\\b", "POLITE", x)
x <- gsub("\\bvideo\\w*\\b", "VIDEO", x)
x <- gsub("\\bvid[sz]*\\b", "VIDEO", x)
x <- gsub("\\bmoney\\w*\\b", "MONEY", x)
x <- gsub("\\bdolla\\w*\\b", "MONEY", x)
x <- gsub("\\bcash\\b", "MONEY", x)
}
t1 = Sys.time()
df_test$review_base <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' &
test_results$piping == 0 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
Unsurprisingly, parallelization of the same code greatly speeds up the processing.
require(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' &
test_results$piping == 0 &
test_results$parallel == 1,]$time <- as.numeric(time)
test_results[test_results$package == 'base' &
test_results$piping == 0 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
Compared to parallelization, piping using the forward pipe operator from the magrittr package has the opposite effect on processing time, resulting in a significant increase. Also, note that some of the base package functions are incompatible with piping, and so in this UDF is only used where possible.
process_text <- function(x){
require(magrittr)
x <- gsub("[^A-Za-z0-9 ]", " ", x) # alphanumeric and spaces only
x <- tolower(x) # to lowercase
x <- x %>% gsub("\\s+", " ", .) %>% # simplify and trim spaces
gsub("^ ", "", .) %>%
gsub(" $", "", .) %>%
gsub("thank y*o*u","thanks", .) # reformat select expressions
x <- gsub("\\b(\\w+)n t\\b","\\1nt", x)
x %>% gsub("\\bfu*ck.*\\b", "PROFANITY", .) %>%
#gsub("\\bfu*ck.*\\b", "PROFANITY", .) %>%
gsub("\\bshi*t+y*\\b", "PROFANITY", .) %>%
gsub("\\bcrap+.*\\b", "PROFANITY", .) %>%
gsub("\\bass\\b", "PROFANITY", .) %>%
gsub("\\basses\\b", "PROFANITY", .) %>%
gsub("\\bda+mn[a-z]*\\b", "PROFANITY", .) %>%
gsub("\\bple+a+s+e+\\b", "POLITE", .) %>%
gsub("\\bthanks+\\b", "POLITE", .) %>%
gsub("\\bcould\\b", "POLITE", .) %>%
gsub("\\bwould\\b", "POLITE", .) %>%
gsub("\\bvideo\\w*\\b", "VIDEO", .) %>%
gsub("\\bvid[sz]*\\b", "VIDEO", .) %>%
gsub("\\bmoney\\w*\\b", "MONEY", .) %>%
gsub("\\bdolla\\w*\\b", "MONEY", .) %>%
gsub("\\bcash\\b", "MONEY", .)
}
t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' &
test_results$piping == 1 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
Combining piping and parallelization we see that parallelization greatly speeds the piping construction - so much so that the piping UDF runs in approximately half the time.
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'base' &
test_results$piping == 1 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
The issues with back reference that arise when using the base package gsub function with magrittr pipes can be allievated by using package regexPipes. The result is only slightly slower than using the base package function with pipes, with workarounds where needed.
process_text <- function(x){
require(magrittr)
require(regexPipes)
x %>% gsub("[^A-Za-z0-9 ]", " ") %>% # alphanumeric and spaces only; uses regexPipes::gsub
tolower %>%
gsub("\\s+", " ") %>% # simplify and trim spaces
gsub("^ ", "") %>%
gsub(" $", "") %>%
gsub("thank y*o*u","thanks") %>% # reformat select expressions
gsub("\\b(\\w+)n t\\b", "\\1nt") %>%
gsub("\\bfu*ck.*\\b", "PROFANITY") %>%
gsub("\\bshi*t+y*\\b", "PROFANITY") %>%
gsub("\\bcrap+.*\\b", "PROFANITY") %>%
gsub("\\bass\\b", "PROFANITY") %>%
gsub("\\basses\\b", "PROFANITY") %>%
gsub("\\bda+mn[a-z]*\\b", "PROFANITY") %>%
gsub("\\bple+a+s+e+\\b", "POLITE") %>%
gsub("\\bthanks+\\b", "POLITE") %>%
gsub("\\bcould\\b", "POLITE") %>%
gsub("\\bwould\\b", "POLITE") %>%
gsub("\\bvideo\\w*\\b", "VIDEO") %>%
gsub("\\bvid[sz]*\\b", "VIDEO") %>%
gsub("\\bmoney\\w*\\b", "MONEY") %>%
gsub("\\bdolla\\w*\\b", "MONEY") %>%
gsub("\\bcash\\b", "MONEY")
}
t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'regexPipes' &
test_results$piping == 1 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'regexPipes' &
test_results$piping == 1 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
Another option is the stri_replace_all_regex and stri_trans_tolower functions from the stringi package, which utilizes the ICU regex engine (note the different syntax for the back reference: "\1" becomes "$1" when using the ICU regex syntax).
process_text <- function(x){
require(stringi)
x <- stri_replace_all_regex(x, "[^A-Za-z0-9 ]", " ") # alphanumeric and spaces only
x <- stri_trans_tolower(x) # to lowercase
x <- stri_replace_all_regex(x, "\\s+", " ") # simplify and trim spaces
x <- stri_replace_all_regex(x, "^ ", "")
x <- stri_replace_all_regex(x, " $", "")
x <- stri_replace_all_regex(x, "thank y*o*u","thanks") # reformat select expressions
x <- stri_replace_all_regex(x, "\\b(\\w+)n t\\b","$1nt")
x <- stri_replace_all_regex(x, "\\bfu*ck.*\\b", "PROFANITY")
x <- stri_replace_all_regex(x, "\\bshi*t+y*\\b", "PROFANITY")
x <- stri_replace_all_regex(x, "\\bcrap+.*\\b", "PROFANITY")
x <- stri_replace_all_regex(x, "\\bass\\b", "PROFANITY")
x <- stri_replace_all_regex(x, "\\basses\\b", "PROFANITY")
x <- stri_replace_all_regex(x, "\\bda+mn[a-z]*\\b", "PROFANITY")
x <- stri_replace_all_regex(x, "\\bple+a+s+e+\\b", "POLITE")
x <- stri_replace_all_regex(x, "\\bthanks+\\b", "POLITE")
x <- stri_replace_all_regex(x, "\\bcould\\b", "POLITE")
x <- stri_replace_all_regex(x, "\\bwould\\b", "POLITE")
x <- stri_replace_all_regex(x, "\\bvideo\\w*\\b", "VIDEO")
x <- stri_replace_all_regex(x, "\\bvid[sz]*\\b", "VIDEO")
x <- stri_replace_all_regex(x, "\\bmoney\\w*\\b", "MONEY")
x <- stri_replace_all_regex(x, "\\bdolla\\w*\\b", "MONEY")
x <- stri_replace_all_regex(x, "\\bcash\\b", "MONEY")
}
t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' &
test_results$piping == 0 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
require(parallel)
n_cores <- detectCores() - 1
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' &
test_results$piping == 0 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
The stringi functions were also tried using piping the forward pipe operators... Once again !!!
process_text <- function(x){
require(stringi)
require(magrittr)
x %>% stri_replace_all_regex("[^A-Za-z0-9 ]", " ") %>% # alphanumeric and spaces only
stri_trans_tolower() %>% # to lowercase
stri_replace_all_regex("\\s+", " ") %>% # simplify and trim spaces
stri_replace_all_regex("^ ", "") %>%
stri_replace_all_regex(" $", "") %>%
stri_replace_all_regex("thank y*o*u","thanks") %>% # reformat select expressions
stri_replace_all_regex("\\b(\\w+)n t\\b", "$1nt") %>%
stri_replace_all_regex("\\bfu*ck.*\\b", "PROFANITY") %>%
stri_replace_all_regex("\\bshi*t+y*\\b", "PROFANITY") %>%
stri_replace_all_regex("\\bcrap+.*\\b", "PROFANITY") %>%
stri_replace_all_regex("\\bass\\b", "PROFANITY") %>%
stri_replace_all_regex("\\basses\\b", "PROFANITY") %>%
stri_replace_all_regex("\\bda+mn[a-z]*\\b", "PROFANITY") %>%
stri_replace_all_regex("\\bple+a+s+e+\\b", "POLITE") %>%
stri_replace_all_regex("\\bthanks+\\b", "POLITE") %>%
stri_replace_all_regex("\\bcould\\b", "POLITE") %>%
stri_replace_all_regex("\\bwould\\b", "POLITE") %>%
stri_replace_all_regex("\\bvideo\\w*\\b", "VIDEO") %>%
stri_replace_all_regex("\\bvid[sz]*\\b", "VIDEO") %>%
stri_replace_all_regex("\\bmoney\\w*\\b", "MONEY") %>%
stri_replace_all_regex("\\bdolla\\w*\\b", "MONEY") %>%
stri_replace_all_regex("\\bcash\\b", "MONEY")
}
t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' &
test_results$piping == 1 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringi' &
test_results$piping == 1 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
The Tidyverse package stringr provides wrappers for the stringi regex functions. In this case I used str_replace_all and str_to_lower. Even for assignment and functions run in series, the stringr functions take significantly longer to run than either the base package functions or those from stringi itself.
process_text <- function(x){
require(stringr)
x <- str_replace_all(x, "[^A-Za-z0-9 ]", " ") # alphanumeric and spaces only
x <- str_to_lower(x) # to lowercase
x <- str_replace_all(x, "\\s+", " ") # simplify and trim spaces
x <- str_replace_all(x, "^ ", "")
x <- str_replace_all(x, " $", "")
x <- str_replace_all(x, "thank y*o*u","thanks") # reformat select expressions
x <- str_replace_all(x, "\\b(\\w+)n t\\b","\\1nt")
x <- str_replace_all(x, "\\bfu*ck.*\\b", "PROFANITY")
x <- str_replace_all(x, "\\bshi*t+y*\\b", "PROFANITY")
x <- str_replace_all(x, "\\bcrap+.*\\b", "PROFANITY")
x <- str_replace_all(x, "\\bass\\b", "PROFANITY")
x <- str_replace_all(x, "\\basses\\b", "PROFANITY")
x <- str_replace_all(x, "\\bda+mn[a-z]*\\b", "PROFANITY")
x <- str_replace_all(x, "\\bple+a+s+e+\\b", "POLITE")
x <- str_replace_all(x, "\\bthanks+\\b", "POLITE")
x <- str_replace_all(x, "\\bcould\\b", "POLITE")
x <- str_replace_all(x, "\\bwould\\b", "POLITE")
x <- str_replace_all(x, "\\bvideo\\w*\\b", "VIDEO")
x <- str_replace_all(x, "\\bvid[sz]*\\b", "VIDEO")
x <- str_replace_all(x, "\\bmoney\\w*\\b", "MONEY")
x <- str_replace_all(x, "\\bdolla\\w*\\b", "MONEY")
x <- str_replace_all(x, "\\bcash\\b", "MONEY")
}
t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' &
test_results$piping == 0 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' &
test_results$piping == 0 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
The longest processing time results when stringr is used with piping. As is true with the stringi functions, ...
Using the same function, but a process with pipeline from magrittr (also utilized in dplyr), which has the advantage of being faster to write...
process_text <- function(x){
require(stringr)
require(magrittr)
x %>% str_replace_all("[^A-Za-z0-9 ]", " ") %>% # alphanumeric and spaces only
str_to_lower() %>% # to lowercase
str_replace_all("\\s+", " ") %>% # simplify and trim spaces
str_replace_all("^ ", "") %>%
str_replace_all(" $", "") %>%
str_replace_all("thank y*o*u","thanks") %>% # reformat select expressions
str_replace_all("\\b(\\w+)n t\\b","\\1nt") %>%
str_replace_all("\\bfu*ck.*\\b", "PROFANITY") %>%
str_replace_all("\\bshi*t+y*\\b", "PROFANITY") %>%
str_replace_all("\\bcrap+.*\\b", "PROFANITY") %>%
str_replace_all("\\bass\\b", "PROFANITY") %>%
str_replace_all("\\basses\\b", "PROFANITY") %>%
str_replace_all("\\bda+mn[a-z]*\\b", "PROFANITY") %>%
str_replace_all("\\bple+a+s+e+\\b", "POLITE") %>%
str_replace_all("\\bthanks+\\b", "POLITE") %>%
str_replace_all("\\bcould\\b", "POLITE") %>%
str_replace_all("\\bwould\\b", "POLITE") %>%
str_replace_all("\\bvideo\\w*\\b", "VIDEO") %>%
str_replace_all("\\bvid[sz]*\\b", "VIDEO") %>%
str_replace_all("\\bmoney\\w*\\b", "MONEY") %>%
str_replace_all("\\bdolla\\w*\\b", "MONEY") %>%
str_replace_all("\\bcash\\b", "MONEY")
}
t1 = Sys.time()
df_test$review_2 <- sapply(df_test$review, process_text)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' &
test_results$piping == 1 &
test_results$parallel == 0,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
cl <- makeCluster(n_cores)
t1 = Sys.time()
df_test$review_2 <- parSapply(cl, df_test$review, process_text)
stopCluster(cl)
time <- round(difftime(Sys.time(), t1, units = 'sec'), 2)
test_results[test_results$package == 'stringr' &
test_results$piping == 1 &
test_results$parallel == 1,]$time <- as.numeric(time)
time
cat('Output identical to standard: ', identical(df_test$review_base, df_test$review_2))
The stringi functions with a series of assignments is the package and structure that runs the fastest, though the base package functions in the same code structure is nearly as fast (and has much shorter function names, which are faster to type). Further, since creating the structure with multiple assignments can be quickly written achieved with the help of find and replace with regex (simply insert 'x <- ' at the beginning of each line), the time saved by typing less when using piping seems to be entirely negated by the substantially longer run times. For me, them, since I'm partial to the syntax of the base functions, it looks like the base functions used with multiple assignments is going to be my default package and strategy.
library(ggplot2)
library(repr)
options(repr.plot.width=7, repr.plot.height=3.5)
ggplot(test_results, aes(as.factor(piping), time,
shape = as.factor(parallel),
color = as.factor(parallel),
label = time)) +
geom_point(size = 2.5) + facet_grid(.~package) + ylim(0,75) +
xlab("Piping") + ylab("Time (s)") +
ggtitle("Execution Time by Package") +
scale_color_manual(values=c("red","blue"), name="Parallel", labels = c("0","1")) +
scale_shape_manual(values = c(16, 17), name="Parallel", labels= c("0","1")) +
geom_text(aes(label=time),size = 3, hjust=-0.1, vjust=-0.5, show.legend = F) +
theme(strip.text = element_text(face = "bold"), plot.title = element_text(face = "bold"))