Contextual Features vs. High-Frequency Tokens

Given my educational background in cognitive linguistics, I’m convinced of the value of information beyond simple words and their stems. For one, human communication is highly context-dependent, and humans are agile communicators since they have knowledge about the situation and the entities involved, allowing for clear and quick understand of the meanings of words. Additionally, human communicators are also sensitive to word choice. For example, human English speakers will typically understand that contractions like “wasn’t” are less formal that the fully expanded equivalent “was not”, and that text that combines letters, numbers, and symbols to represent words instead of spelling them out, as in “gr8” for “great”, are even more informal still. Given that human language is so sensitive to nuances, I suspect that the more formulaic nature of spam (here, YouTube comments posted automatically by bots) means that the differences between the language used in spam comments and ham comments go beyond simple word choice even if I’m unsure of which category, spam or non-spam (ham), will exhibit which particular feature. In particular, I suspect that there are noticeable differences between the rate at which symbols or numeric characters are used, the length of the comments themselves, and, more obviously, the presence or absence of hyperlinks to other pages and sites. Further, I also suspect that spam and ham will relate differently to contextual information, such as the preformer of the video to which the comments are posted. Perhaps videos from one performer or another attract a particular kind of spam, or particularly lewd comments.

Given these sublties in language use, then, is it possible to build an adequate model for spam identification based on a moderate number of hand-selected parameters, such as simple measurements of the lanauge used (such as word count or word length), contextual information (like the names of the videos’ performers), and a few particularly salient lexical terms? To answer this question I used a set of 1956 YouTube comments already marked as spam or ham, created features based on my hypotheses about the potential patterns in the language used, and developed a predictive model using logistic regression and elastic net regularization. To test the power of such context-based features, I not only judged the adequacy of the final model by its performance in predicting the comment class of the training set, I also built a second model of an approximately equal number of features, though features based entirely on highly frequent but informative word stems, to use as a benchmark for model performance. Ultimately I found that both models fit the data well, but the model based on context-informed features outperformed the one based on frequent stems alone.

Intro
Data Splitting
Feature Building
Data Exploration
Context-Based Features Model
Token Only Features Model
Conclusion

Data Preparation

For the data, I used the “YouTube Spam Collection” available from both the UC Irvine machine learning data sets and from the website by the data set’s contributors, Tiago A. Almeida and Tulio C. Alberto. The set contains 1,956 public comments posted to five music videos that were highly viewed during the same time period: “Gangnam Style” by Psy, “Roar” by Katy Perry, “Rock Party Anthem” by LMFAO, “Love The Way You Lie” by Eminem, and “Waka Waka (This Time for Africa)” by Shakira. In addition to the comments themselves, the set contains for each comment the comment ID, the author, the date and time of posting, and the categorization as spam or non-spam (‘ham’) made by the set’s authors.

Since I was working directly with text, I first checked the document encoding using pacage stringi to be sure, and saw that it had UTF-8 encoding.

require(readr)
require(xml2)
require(dplyr)
require(sqldf)
require(stringi)
require(stringr)
require(tm)
require(wordcloud)
require(ggplot2)
source("Multiplot.R")

stri_enc_detect(stri_read_raw("YouTube_spam/Youtube01-Psy.csv"))

## [[1]]
## [[1]]$Encoding
## [1] "UTF-8"      "ISO-8859-1" "ISO-8859-2" "UTF-16BE"   "UTF-16LE"  
## [6] "Shift_JIS"  "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] ""   "en" "ro" ""   ""   "ja" "tr" "he"
## 
## [[1]]$Confidence
## [1] 1.00 0.33 0.11 0.10 0.10 0.10 0.06 0.01

Because I was using a Windows machine and could potentially have problems with encoding when reading in UTF-8 encoded data using the base function, I used the read_csv function from the package readr. As can be seen from the comment text read in using the base pacakge (in the first sample, below), unicode characters aren’t converted helpfully when the base package function is used. For example, the unicode byte order mark (BOM) ‘U+FEFF’ can be seen as converted to ‘Ã¯Â»Â¿’, while in the second sample (from data read in using readr) the BOMs have remained in unicode.

# can try results='asis' 
psy_readr <- read_csv("YouTube_spam/Youtube01-Psy.csv", locale = locale(encoding = "UTF-8"))
psy_base <- read.csv(file="YouTube_spam/Youtube01-Psy.csv", header = T,stringsAsFactors = F)
psy_base[225:230, ]$CONTENT

## [1] "prehistoric song..has beenï»¿"                                      
## [2] "You think you're smart?        Headbutt your face.ï»¿"              
## [3] "DISLIKE.. Now one knows REAL music - ex. Enimen ï»¿"                
## [4] "Loool nice song funny how no one understands (me) and we love itï»¿"
## [5] "Like if you came here too see how many views this song has.ï»¿"     
## [6] "We pray for you Little Psy â<U+0099>¡ï»¿"

psy_readr[225:230, ]$CONTENT

## [1] "prehistoric song..has been<U+FEFF>"                              
## [2] "You think you're smart?        Headbutt your face.<U+FEFF>"      
## [3] "DISLIKE.. Now one knows REAL music - ex. Enimen <U+FEFF>"        
## [4] "Loool nice song funny how no one understands (me) and we love it<U+FEFF>"
## [5] "Like if you came here too see how many views this song has.<U+FEFF>"
## [6] "We pray for you Little Psy <U+2661><U+FEFF>"

Further, even when UTF-8 encoding is successfully achieved using readr::read_csv, the BOMs (UTF-8: U+FEFF) contained in the comments aren’t successfully removed despite the function’s tendancy to silently remove them. After finding that opening and resaving the files as UTF-8 without signature in Emacs also failed to remove these BOMs, the only solution I found was to find the files “literally” in Emacs using command “find-file-literally” and then use search and replace to remove all of the BOMs in each of the csv files. We can see from the comment text of the final imported version that unicode characters are not translated, and that the BOMs have been removed.

psy <- read_csv("YouTube_spam/Youtube01-Psy_emacs.csv", locale = locale(encoding = "UTF-8"))
katy <- read_csv("YouTube_spam/Youtube02-KatyPerry_emacs.csv", locale = locale(encoding = "UTF-8"))
lmfao <- read_csv(file="YouTube_spam/Youtube03-LMFAO_emacs.csv", locale = locale(encoding = "UTF-8"))
eminem <- read_csv(file="YouTube_spam/Youtube04-Eminem_emacs.csv", locale = locale(encoding = "UTF-8"))
shakira <- read_csv(file="YouTube_spam/Youtube05-Shakira_emacs.csv", locale = locale(encoding = "UTF-8"))
psy[230, ]$CONTENT

## [1] "We pray for you Little Psy <U+2661>"

Before creating a master data frame by combining the five different datasets, I created several new columns of information that would have the same value for all of the observations for a given video: the video the comment occured on using the artists’ names, and the gender of the peformers (determined from basic Wikipedia searches). This additional information was to be tested as a feature of the predictive model, since it is possible that particular videos may attract more spam, or more of a certain kind of spam, than others.

psy$VIDEO <- rep('psy')
katy$VIDEO <- rep('katyperry')
lmfao$VIDEO <- rep('lmfao')
eminem$VIDEO <- rep('eminem')
shakira$VIDEO <- rep('shakira')

psy$A_ISMALE <- rep(1)
eminem$A_ISMALE <- rep(1)
katy$A_ISMALE <- rep(0)
shakira$A_ISMALE <- rep(0)
lmfao$A_ISMALE <- rep(1)

I then created a master data frame, combining those for the individual videos. I then renamed the ‘Class’ column to “IS_SPAM” to make the meaning of its binary values more transparent, and then finally converted the features “IS_SPAM” and “VIDEO” to factors.

nc <- ncol(psy)
df <- rbind(psy[,2:nc],katy[,2:nc], lmfao[,2:nc], eminem[,2:nc], shakira[,2:nc])

cnames <- colnames(df)
cnames[4] <- "IS_SPAM"
colnames(df) <- cnames
df$IS_SPAM <- as.factor(df$IS_SPAM) 
df$VIDEO <- as.factor(df$VIDEO)

One last step of text clean-up was necessary: converting the HTML character entities (like ‘&#…’) to unicode text. Looking at the unique HTML characters hit, we can see that there were several of these characters that needed to be removed or replaced.

html_chars <- str_extract_all(df$CONTENT, "&.{1,10}?;")
unique(unlist(html_chars))

## [1] "&amp;"  "&gt;"   "&lt;"   "&#39;"  "&quot;"

The conversion of these HTML characters was achieved using the xml2 package and a function suggested on Stackoverflow. Following the conversion, none of these same characters were found if the search was re-executed.

unescape_html <- function(str){xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))}

df$CONTENT <- sapply(df$CONTENT, unescape_html)
html_chars <- str_extract_all(df$CONTENT, "&.{1,10};")
unique(unlist(html_chars))

## character(0)

After the format of the data had been stabilized, I then created a number of numeric features constructed to measure the length of the comments in several different manners. The first was the simple character length of the comments. I also wanted to know if spam would differ from ham in the proportion of alphabetic characters, numeric characters, or symbols, and so I also calculated the lengths of the comments when reduced to each of these character classes. From those lengths I then calculated the percentage of the total character length that each of these character classes made up.

df$LENGTH <- nchar(df$CONTENT)
df$AN_LENGTH <- nchar(gsub("[^a-zA-Z0-9]","", df$CONTENT)) 
df$A_LENGTH <- nchar(gsub("[^a-zA-Z]","", df$CONTENT))
df$N_LENGTH <- nchar(gsub("\\D","", df$CONTENT)) 
df$PERCENT_A <- df$A_LENGTH/df$LENGTH
df$PERCENT_N <- df$N_LENGTH/df$LENGTH
df$PERCENT_NONAN <- (df$LENGTH - df$AN_LENGTH)/df$LENGTH

In addition to measuring the length of the comments by the number of characters they contained, I also wanted to measure their lengths by token counts. However, given that these comments didn’t contain formal academic or professional text in which words were likely to be spelled out in purely alphabetic characters, I measured token counts at various levels of text normalization. First, as some concepts could be expressed by single or combined symbols or numbers, or by the combination of non-alphabetic characters with the alphabetic, I calculated the “raw” word counts for the comments by splitting the raw comment text by spaces. Before doing so, I converted all multiple and single whitespace characters to a single space, and trimmed the leading and trailing spaces from the comment strings. Then, I used the tm package’s tokenizer function and the base package’s length function to count the words in each of the comments.

df$CONTENT <- sapply(df$CONTENT, function(x) stringr::str_replace_all(x,"[\\s]+", " "))
df$CONTENT <- sapply(df$CONTENT, function(x) stringi::stri_trim_both(x, pattern="\\P{Wspace}"))
df$RAW_WORDS <- sapply(df$CONTENT, tm::scan_tokenizer)
df$RAW_WC <- sapply(df$RAW_WORDS, length)

I also calculated the word counts for slightly more normalized text that included alpha-numeric characters, but not punctuation or other symbols. For this token count I first converted the strings to lower case, replaced punctuation and other non-alphanumeric characters to single spaces, and once again condensed and trimmed the whitespace before tokenizing.

df$NORM_CHAR <- sapply(df$CONTENT, stringr::str_to_lower)
df$NORMALIZED <- sapply(df$NORM_CHAR, function(x) stringr::str_replace_all(x,"[^a-z0-9 ]", " "))
df$NORMALIZED <- sapply(df$NORMALIZED, function(x) stringr::str_replace_all(x,"[\\s]+", " "))
df$NORMALIZED <- sapply(df$NORMALIZED, function(x) stringi::stri_trim_both(x, pattern="\\P{Wspace}"))
df$NORM_WORDS <- sapply(df$NORMALIZED, tm::scan_tokenizer)

Then, using the text that still included numbers as ‘words’ (though not punctuation or symbols), I calculated and the mean, minimum, and maximum lengths of the words themselves for each comment, believing that either spam or ham might tend towards longer, more complex words, or else towards more abbreviations and shorthand and therefore exhibit some difference in word length.

word_lengths <- function(text_list, type = "mean"){
  require(stringr)
  
  temp <- sapply(text_list, function(x) stringr::str_replace_all(x,"[^A-Za-z0-9]",""))
  if (length(temp) == 0){return(0)}
  lengths <- sapply(temp, nchar)
  if (type == "mean"){output <- mean(lengths)}
  if (type == "max"){output <- max(lengths)}
  if (type == "min"){output <- min(lengths)}
  return(output)
}

df$ALPHANUM_WC <- sapply(df$NORM_WORDS, length)
df$MAX_WLEN <- sapply(df$NORM_WORDS, function(x) word_lengths(x, "max"))
df$MIN_WLEN <- sapply(df$NORM_WORDS, function(x) word_lengths(x, "min"))
df$MEAN_WLEN <- sapply(df$NORM_WORDS, function(x) word_lengths(x, "mean"))

Finally, I finished normalizing the comment text and calculated the thrid and final word count. To complete normalization I removed numbers as well as the “http(s)”, “www”, and “com” from “https:’ since I wanted the entirety of a link to count as a single”word“, and since the presence of links would be encoded in a separate feature. Having completed the text normalization, word counts were again calculated using the tokenization and length functions.

df$FULL_NORM <- sapply(df$NORMALIZED, tm::removeNumbers)
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x,"https*"," "))
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x,"www"," "))
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x,"com"," "))
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x," +"," "))
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringi::stri_trim_both(x, pattern="\\P{Wspace}"))
df$NORM_WORDS <- sapply(df$FULL_NORM, tm::scan_tokenizer)
df$NORM_WC <- sapply(df$NORM_WORDS, length)

Intro
Data Prep
Feature Building
Data Exploration
Context-Based Features Model
Token Only Features Model
Conclusion

Data Splitting for Model Training and Testing

I wanted to test not only the developed model over data separate from that used for model fitting, I also wanted to test in separate data the features of words, themes, and grammatical patterns I would create based on lexical frequency in training set. Because of this need for completely separate training and testing data before calculating lexical frequencies, I split the data into two distinct sets before I conducted the text exploration. Of note, although I only used the training data for the text exploration, since the creation of the features based on my observations had to be completed for all of the data, training and testing, I created features in the main data frame and split the data once again at the time of model fitting and testing.

df$assignment <- 0
df$assignment <- runif(nrow(df), 0, 1)
df$assignment <- ifelse(df$assignment <= .75, 0, 1)

dataSplit <- function(){
  training <- df[df$assignment == 0, ]
  testing <- df[df$assignment == 1, ] 
}

dataSplit()

Intro
Data Prep
Data Splitting
Data Exploration
Context-Based Features Model
Token Only Features Model
Conclusion

Text Exploration and Feature Building

In addition to varying in length by character and word counts, I suspected that spam and ham comments would also diverge in the topics they discussed and the words and grammatical patterns they used to do so. First and foremost, I suspected that spam was probably advertisement-like and was therefore more likely to contain hyperlinks meant to direct the reader to a different page or site. Given this, I added a binary feature that indicated inclusion of a hyperlink in the comment text, indicated roughly by the presence of ‘http:’, ‘https:’, or ‘www.’. Since the particular regular expressions I used look for punctuation, they were run over the raw comment text rather than either of the normalized versions of the comments. Also, since some of comments would have more than one of these constructions, the final binary value for each comment was calculated by summing all the hits by these regular expressions and converting to 1 any value over 0. Finally, the class of the column was then converted to factor.

df$link1 <- stringr::str_detect(df$CONTENT, "https?:")
df$link2 <- stringr::str_detect(df$CONTENT, "www\\.")
df$link1 <- ifelse(df$link1 == TRUE, 1, 0) 
df$link2 <- ifelse(df$link2 == TRUE, 1, 0)

df$HAS_LINK <- rowSums(df[,c("link1","link2")])
df$HAS_LINK <- ifelse(df$HAS_LINK > 0, 1, 0)
df$HAS_LINK <- as.factor(df$HAS_LINK)

I wanted to create lexical-based features, but wanted to explore the lexicon of the comments first in order to see which words in particular occured more frequently in spam than in ham (and also those which that occured more frequently in ham than spam), as those words seemed to be the best candidates for association with one comment class or the other. Before creating corpora and document-term matrices for for the data split by comment class, however, I created an abbreviated stopword list to be applied to the corpora. While I wanted to remove the least informative high-frequency words that were likely to occur (e.g. articles like “the”), I didn’t want to remove all of the frequent words that would be eliminated by a more aggressive stopword list, thinking that the use of some very common words such as modal verbs ‘could’ or ‘would’, or certain persons, numbers, and genders of personal pronouns like ‘he’, ‘she’, and ‘we’, would be used differently between spam and ham frequently enough to be noticeable.

After creating my abbreviated stopword list, I created two corpora from the fully normalized text data, stripping the whitespace and removing the stopwords. I then created the document-term matrices for each comment class, and reduced their sparsity by eliminating the most infrequent words from them.

exclusions <- c("the", "and", "or", "to", "for", "from", "s", "at", "with", "this", "that", 
                "it", "its", "is", "are", "am", "was", "be", "will", "but")

create_dtms <- function(){
  corp_spam <- Corpus(VectorSource(training[training$IS_SPAM == 1,]$FULL_NORM))
  corp_spam <- tm_map(corp_spam, stripWhitespace)
  corp_spam <- tm_map(corp_spam, removeWords, exclusions)
  dtm_spam <- DocumentTermMatrix(corp_spam)
  dtm_spam <- removeSparseTerms(dtm_spam, 0.995)

  corp_ham <- Corpus(VectorSource(training[training$IS_SPAM == 0,]$FULL_NORM))
  corp_ham <- tm_map(corp_ham, stripWhitespace)
  corp_ham <- tm_map(corp_ham, removeWords, exclusions)
  dtm_ham <- DocumentTermMatrix(corp_ham)
  dtm_ham <- removeSparseTerms(dtm_ham, 0.995)
}

create_dtms()

Next, I created a separate data frame from the document-term matrices, using two intermediate data frames combined via a SQL query using the package sqldf. In addition to selecting the words themselves, I selected for each word: the frequency in the spam corpus, the frequency in the ham corpus, the number of more instances occuring in spam than ham, the number of more instances occuring in ham than spam, and the total frequency in both corpora combined.

create_freqComp <- function(){
  f_spam <- data.frame(sort(colSums(as.matrix(dtm_spam)), decreasing=TRUE))
  f_spam <- tibble::rownames_to_column(f_spam, var="word")

  f_ham <- data.frame(sort(colSums(as.matrix(dtm_ham)), decreasing=TRUE))
  f_ham <- tibble::rownames_to_column(f_ham, var="word")

  cnames <- c('word','frequency')
  colnames(f_ham) <- cnames
  colnames(f_spam) <- cnames
  freq_compare <- sqldf("select word, SF, HF, (SF-HF) as DIFF_MORE_S, (HF-SF) as DIFF_MORE_H, 
                        (SF+HF) as TOTAL from
                        (select word, coalesce(SF,0) as SF, coalesce(HF,0) as HF from
                        (select f_spam.word, f_spam.frequency as SF, f_ham.frequency as HF
                        from f_spam left outer join f_ham 
                        on f_spam.word = f_ham.word
                        union
                        select f_ham.word, f_spam.frequency as SF, f_ham.frequency as HF
                        from f_ham left outer join f_spam 
                        on f_ham.word = f_spam.word) x ) y;")

  freq_compare[freq_compare$DIFF_MORE_S < 0,]$DIFF_MORE_S <- 0
  freq_compare[freq_compare$DIFF_MORE_H < 0,]$DIFF_MORE_H <- 0

  total <- nrow(freq_compare)
  more_spam <- nrow(freq_compare[freq_compare$DIFF_MORE_S >0, ])
  more_ham <- nrow(freq_compare[freq_compare$DIFF_MORE_H >0, ])
  same_spam_ham <- nrow(freq_compare[freq_compare$DIFF_MORE_S == 0 & freq_compare$DIFF_MORE_H == 0,])
}

create_freqComp()

The resulting data frame looked like this:

head(freq_compare)

##             word SF HF DIFF_MORE_S DIFF_MORE_H TOTAL
## 1          about 19 11           8           0    30
## 2       actually  4  4           0           0     8
## 3           adam  4  0           4           0     4
## 4            add  5  0           5           0     5
## 5      advertise 10  0          10           0    10
## 6 advertisements  6  0           6           0     6

From this it was easy to see how many more words are found in each of the corpora, as well as how many are found equally as often in both. From this we can see that a wider variety of words (at least those in a traditional sense, spelled out using alphabetic characters) occurred in spam comments.

total <- nrow(freq_compare)
more_spam <- nrow(freq_compare[freq_compare$DIFF_MORE_S >0, ])
more_ham <- nrow(freq_compare[freq_compare$DIFF_MORE_H >0, ])
same_spam_ham <- nrow(freq_compare[freq_compare$DIFF_MORE_S == 0 & freq_compare$DIFF_MORE_H == 0,])
cat(paste("Total Words in Corpus: ", total, 
          "\nWords Occuring More Frequently in Spam: ", more_spam, 
          "\nWords Occuring More Frequently in Ham: ", more_ham, 
          "\nWords Occuring Same Number of Times in Both Spam and Ham: ", same_spam_ham, "\n"))

## Total Words in Corpus:  500 
## Words Occuring More Frequently in Spam:  379 
## Words Occuring More Frequently in Ham:  113 
## Words Occuring Same Number of Times in Both Spam and Ham:  8

As I was looking for a small number of very salient features that indicate a comment’s class as ham or spam rather than looking for the presence of all 400-something words, I concentrated my efforts on the words that occur most often in one the two corpora but not the other. Looking at these words I then made generalizations about the types or meanings of the words, or any themes that emerged. I started this process using word clouds to glance at the top 45 words that occured more frequently in the spam corpus than the ham (left), the top 40 that occured more frequently in the ham corpus (right).

par(mfrow=c(1,2))
wordcloud::wordcloud(freq_compare$word, freq_compare$DIFF_MORE_S, max.words=45, colors=brewer.pal(1, "Dark2"))
wordcloud::wordcloud(freq_compare$word, freq_compare$DIFF_MORE_H, max.words=45, colors=brewer.pal(1, "Dark2"))

The initial glance at these word clouds, though, suggested two edits to be made to the text of the corpora before making generalizations. The first is collapsing into the same word “thanks” and “thank you”, which mean essentially the same thing and are used in the same way. Since both “thanks” and “thank” can be seen in the spam cloud (above left) as can “you”, collapsing “thanks” and “thank you” into a single term “thanks” will give us a better understanding of the frequency of “thanking” expressions and will also reduce the frequency of “you” (which is quite large in the word cloud because of its high frequency), giving us a more accurate view of its frequency of use outside of “thanking” expressions.

The high frequency of “you” also suggests another edit to the words of the corpora: collapsing all versions of “youtube”. Since it is possible that some comments have spelled “youtube” as “you tube”, and since “youtube” as one word is itself of high frequency, collapsing all different variations of “youtube” into one word will give us a clearer picture of the frequency of intended word “youtube” in addition to potentially improving our udnerstanding of the frequency of “you” used outside of “you tube”.

To collapse these spelling variations, I first checked the hits of a regular expression built to hit all of the variations of “youtube”. Looking at the following variations, I was comfortable with assuming that they are essentially the same word referring to the same entity (YouTube) for the purposes of my feature building. Even though some of these words (particularly “youtuber(s)”) were technically different, the most important aspect of them, referring to YouTube, was still the same.

youtubes <- str_extract_all(df$FULL_NORM, "\\byou *tu\\w*\\b")
unique(unlist(youtubes))

## [1] "you tube"  "youtube"   "youtu"     "youtuber"  "youtubyou" "youtubese"
## [7] "youtubesi" "youtubers"

Unlike my confidence in generalizing over all of the possible variations of “youtube”, however, I wouldn’t make the same assumption for all of the “thank(-something)” expressions hit by the following regex.

thanks <-  str_extract_all(df$FULL_NORM, "\\bthank\\w* *w*\\b")
unique(unlist(thanks))

## [1] "thanks"      "thanks "     "thank "      "thankful "   "thankfully "
## [6] "thankss"

Though ‘would be thankful’ essentially fills the role of signalling a polite plea to a reader just as “thanks” and “thank you” do, its use is more like “thankfully”, which is more complex grammatically than expressions like “thanks” and “thank you”. Further, since there are only one instance of each “thankful” and “thankfully” in the entire data set (shown below), finding and changing them wouldn’t have a large impact on the distribution of “thanks” in the data, and so that small effect wouldn’t be proportional to the time and effort spent writing and running these expressions. As such, I tailored my regular expression to only change “thank you” and “thankss” into “thanks”.

df[str_detect(df$FULL_NORM,"thankful"),]$CONTENT

## [1] "Katy perry is and inspirational singer her voice is awesome and I loved her so much . She achieved music history and I couldn't believe that . Guys if you could take 1min to go to my channel and watch my first video I would be greatly thankful for this :) ty guys N katy is still awesome xoxo"                                                                                                                                                                                                               
## [2] "3 yrs ago I had a health scare but thankfully I<U+0092>m okay. I realized I wasn<U+0092>t living life to the fullest. Now I<U+0092>m on a mission to do EVERYTHING I<U+0092>ve always wanted to do. If you found out you were going to die tomorrow would you be happy with what you<U+0092>ve accomplished or would you regret not doing certain things? Sorry for spamming I<U+0092>m just trying to motivate people to do the things they<U+0092>ve always wanted to. If you<U+0092>re bored come see what I<U+0092>ve done so far! Almost 1000 subscribers and I just started!"

Lastly, I noted that “don” shows up in the spam word cloud (left). I was confident that this was simply “don’t” without the “t”, which is separated by a space given the pre-processing of the text in which punctuation and symbols were replaced with a space. Looking at the instances of words that end with “n” and are followed by a space and a lone “t” in the data, it looks like all except “n t” can be confidently collapsed with their alternative spellings which lack a space, as in “dont” and “wont”.

nt <- str_extract_all(df$FULL_NORM, "\\b\\w*n t\\b")
unique(unlist(nt))

##  [1] "isn t"    "don t"    "won t"    "wasn t"   "doesn t"  "can t"   
##  [7] "didn t"   "couldn t" "wouldn t" "aren t"   "weren t"  "haven t" 
## [13] "ain t"    "n t"

nt <- str_extract_all(df$FULL_NORM, c("\\bdont\\b", "\\bwont\\b"))
unique(unlist(nt))

## [1] "dont" "wont"

After pinpointing which words I wanted to collapse, I updated the fully normalized comments in the original dataframe itself with the following regular expressions.

df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x,"\\byou *tu\\w*\\b","youtube"))
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x,"thank y*o*u","thanks"))
df$FULL_NORM <- sapply(df$FULL_NORM, function(x) stringr::str_replace_all(x,"\\b(\\w+)n t\\b","\\1nt"))

After recreating the corpora and the data frame I used to compare frequencies of words between the spam and ham comments, I recreated the word clouds and finally looked for generalizations in the words, grammar, and themes that I could model and include as features.

dataSplit()
create_dtms()
create_freqComp()

par(mfrow=c(1,1))
wordcloud(freq_compare$word, freq_compare$DIFF_MORE_S, max.words=45, colors=brewer.pal(1, "Dark2"))

Looking at the top words that occured more often in spam than ham in the word cloud above, I noticed several patterns in particular. The first is that spam comments seem very polite given the presence of polite words like “thanks” and, of even greater frequency as indicated by the orange color and larger size of the text, “please”. As such, I included a “polite word” feature that indicated whether one or the other (or both) of these polite words occurs in a comment. Futher, since polite speech often includes modal verbs like “would” and “could” as in “could you pass the salt”, and since both of these words in the spam word cloud, I included a feature based on the presence of these two modal verbs in particular.

Additional generalizations can be made about the spam comments given its word cloud. The fact that “money” occurs much more frequently in spam than ham suggests the need for a “money”-feature that targets “money” and its (near-)synonyms like “cash” and “dollars”. The presence of “home”, “free”, “chance”, and “dont” in addition to “money” makes me suspect that the spam contains advertisements for free stuff with the exhortation that viewers shouldn’t miss their chance at getting something, or else consist of advertisements for work from home jobs. Given this, I would include features for the specific words “free”, “chance”, and “dont”, as well as a feature for “home” combined with “homes”, “house”, and “houses”. Also, it appears that spam tends to mention other websites generally or by name (e.g. “website”, “facebook”), and so features that target “website” and its synonyms and “Facebook” and its alternate spellings would be built.

Of particular note, it looks like spam tends to discuss more frequently than ham does YouTube channels and the things one does with them (“youtube”, “video”,“channel”, “subscribe”, “watch”). As the words seem to come from the same situation or scene in the world (what, in cognitive linguistics, we’d call a semantic frame), I built features not only for these invidual words, but also for this broader scene, combining terms like “youtube”, “subscribe”, and “channel” to compare the performance of a broad-themed feature to its component parts.

Similarly, I created a broad verb-type feature based on the verbs seen in the spam word cloud. Interestingly, not only are there many verbs, most of them are in the uninflected form (“check”, “watch”, “make”, “share”, “give”, and “take”). Though the uninflected form is used in a variety of ways in English, I believe based on the other words present in the word cloud that these uninflected verbs are imperatives, used to direct the reader to perform a particular action. Given this, I felt it was important to create features of these verbs as they occur in this particular tense/form, rather than accomodating all the different possible inflections they could be seen with, such as the third person singular present with a final ‘-s’. In addition to creating features for each of these verbs themselves, I created a “take action!”-verbs feature that combined these terms so that I could once again compare a broad-themed feature to its component parts.

The contents of the ham word cloud, too, lend themselves to some generalizations. Whereas the spam comments appear to be rather politely worded, from the ham word cloud below it appears that ham comments, instead, are quite rude just given the sheer number of profanities seen. Given these, I included a “profanity” feature that indicates whether or not a comment contains a common profanity.

wordcloud(freq_compare$word, freq_compare$DIFF_MORE_H, max.words=45, colors=brewer.pal(1, "Dark2"))

Just as the spam demonstrated an affinity for certain categories of words which could be leveraged as model features, the ham comments did as well. For example, the ham comments mention large quantities (“million”,“billion”), and use question particles/subordinating conjunctions (“when”, “why”) and adverbs (“still”, “almost”, “ever”, “never”). Particularly prolific are adjectives expressing judgements of goodness (“good”, “bad”, “best”, “perfect”, “awesome”), and of beauty (“beautiful”, and potentially “looks”). There are also numerous exclamations (“wow”, “omg”, “lol”)

Another salient pattern seen in the ham word cloud it that the names of the performers and the titles of the songs of these particular videos are mentioned. As such, I created a feature that indicates whether a comment mentions the performer or song of the video to which it is posted.

Lastly, there are similar and even related words split between the two word clouds. For example, “song” and “love” are seen more often in ham, while “music” and “like” are seen more often in spam. As such, these words were included individually as features rather than being collapsed into categories of related words. More generally, though, the fact that these similar words show different associations with the comment classes highlights that blind, automatic lumping of similar words together is potentially detrimental in trying to find features for text models.

Feature Creation

Creating the “profanity” feature was the biggest undertaking of the feature building. First, I wanted an idea of the variety of profane words occuring in the data, and so created a filter to extract basic profanities. The filter used a number of regular epxressions that accomodated some widespread and predictable spelling variations (such as “f ck”, which be the result of our preprocessing if the word had been spelled “f*ck”), as well as different tenses and parts of speech (such as “f*cking” and “f*cker” in addition to “f*ck” itself). Also, since I wanted to know the variety of profanities that occur in the entire comment population, not just the spam, I created a full-population corpus to search within.

extract_words <- function(text, word){
  words <- list()
  words[length(words) + 1] <- stringr::str_extract(text, paste("\\b",word,"\\b", sep=""))
  words}

corp <- Corpus(VectorSource(df[df$AN_LENGTH > 0,]$FULL_NORM))
corp <- tm_map(corp, stripWhitespace)
profanity <- c("fu*ck", "fu*cked", "fu*cking", "fu*ck[ie]n", "fu*cks", "fu*cker+", "fu*cker+s+", 
               "f ck", "f cked", "f cking", "f ck[ei]n", "f cks", "f cker", "f ckers", 
               "shi*t+", "sha*t+", "shi*t+y", "shi*t+ier", "shi*t+iest", "shi*t+ing", "shi*t+[ie]n", 
               "shi*t+s", "shi*t+ed", "shi*t+er", "shi*t+ers", 
               "sh t", "sh t+y", "sh t+ier", "sh t+iest", "sh t+ing", "sh t+[ei]n", "sh t+s", 
               "sh t+ed", "sh t+er", "sh t+ers", 
               "crap+", "crap+y", "crap+ier", "crap+iest", "crap+ed", "crap+ing", 
               "crap+[ei]n", "crap+s", "crap+er", "crap+ers", 
               "cr p", "cr p+y", "cr p+ier", "cr p+iest", "cr p+ed", "cr p+ing", 
               "cr p+[ei]n", "cr p+s", "cr p+er", "cr p+ers", 
               "ass", "asses", "assholes*", "asshats*", "assclowns*", 
               "da+mn", "damn[a-z]+")

fourletters <- list()
for (word in profanity){
  fourletters[length(fourletters) + 1] <- extract_words(corp, word)
}
fourletters <- lapply(fourletters, function(x) x[! is.na(x)])
unlist(unique(fourletters))

##  [1] "fuck"        "fucked"      "fucking"     "fucken"      "shit"       
##  [6] "shitty"      "crap"        "ass"         "damn"        "damnnnnnnnn"

Of note, this list, like that of variations on ‘thanks’, is not comprehensive. For example, “fuq” of “why dafuq” in the example below could have been included. However, since I want to catch the most prolific examples of these most basic profane words, targeting the basic spelling and morpholoical variations which are much more likely to occur than oddball variations is not only sufficient, it is probably the most time and resource-efficient approach.

fuq <- extract_words(corp, "\\b\\w*fuq\\w*\\b")
cat(paste(unlist(unique(fuq)), '\n\n', 
          df[stringr::str_detect(df$FULL_NORM, "\\b\\w*fuq\\w*\\b"), ]$CONTENT))

## dafuq 
## 
##  Why dafuq is a Korean song so big in the USA. Does that mean we support Koreans? Last time I checked they wanted to bomb us.

Using the profanities found, a single binary feature ‘contains profanity’ was then created. To achieve this, all versions of profanity detected were searched in the fully normalized comments. This list of regular expressions was limited to those attested in the data rather than using all of the possible variations modeled in the filter in order to speed processing, though the entire list could have applied with the non-attested regular expresions applied vacuously.

To create the feature the results of the individual regex searches were saved in a data frame, were then binarized from logical “TRUE”/“FALSE”, and then summed by comment. Sums that were greater than 0 were converted to 1, and these results were finally added to the main data frame as the feature “PROFANITY”.

prep_detected <- function(lone_col){
  lone_col <- as.data.frame(lone_col)
  colnames(lone_col) <- "FULL_NORM"
  lone_col$FULL_NORM <- as.character(lone_col$FULL_NORM)
  lone_col}

detect <- function(strdata, words2find){
  col_start = ncol(strdata) + 1
  col = col_start
  for (word in words2find){
    strdata[,col] <- sapply(strdata[,1], 
                            function(x) stringr::str_detect(x, paste("\\b",word,"\\b", sep="")))
    col = col + 1}
  max_col = col
  col = col_start
  while (col < max_col){
    strdata[,col] <- ifelse(strdata[,col] == TRUE, 1, 0)
    col = col + 1}
  if (length(words2find) > 1){strdata$sum <- rowSums(strdata[,col_start:(max_col - 1)])}
  else{strdata$sum <- strdata[,col_start]}
  strdata$sum <- ifelse(strdata$sum > 0, 1, 0)
  strdata$sum
}

profanity <- c("fu*ck", "fu*cked", "fu*cking", "fu*ck[ie]n", 
               "shi*t+", "shi*t+y", 
               "crap", 
               "ass", 
               "da+mn", "damn[a-z]+")

detected <- df$FULL_NORM
detected <- prep_detected(detected)
df$PROFANITY <- detect(detected, profanity)

The next most involved feature creation method was used for creating the features that indicate whether a comment mentions the name of the performer or the title of the song of the particular video it was posted to. Since I only wanted to count name and title mentions that were contextually appropriate, the regular expressions for the performers’ names and the song titles were searched over the corresponding subset of the data. For example, the name “psy” (plus the spelling variation where a final “s” was tacked on without an apostrophe) and the title words “gangnam” and “style” (plus the spelling variation where no spaces where used between the title’s words) were only searched in comments made on Psy’s video.

df$ARTIST <- 0
df$TITLE <- 0

# psy
detected <- df[df$VIDEO == "psy",]$FULL_NORM
detected <- prep_detected(detected)
df[df$VIDEO == "psy",]$ARTIST <- detect(detected, c("psy", "psys+"))
df[df$VIDEO == "psy",]$TITLE <- detect(detected, c("gangnam", "gangnamstyle\\w*", "style"))

#katyperry
detected <- df[df$VIDEO == "katyperry",]$FULL_NORM
detected <- prep_detected(detected)
df[df$VIDEO == "katyperry",]$ARTIST <- detect(detected, c("katys*", "katie*s*", "katyper+ys*", "per+ys*"))
df[df$VIDEO == "katyperry",]$TITLE <- detect(detected, c("roar", "roar\\w*"))

#lmfao
detected <- df[df$VIDEO == "lmfao",]$FULL_NORM
detected <- prep_detected(detected)
df[df$VIDEO == "lmfao",]$ARTIST <- detect(detected, c("lmfao", "lmfaos+"))
df[df$VIDEO == "lmfao",]$TITLE <- detect(detected, c("party", "rock", "anthem", "partyrock\\w*", "rockanthem\\w*"))

#eminem
detected <- df[df$VIDEO == "eminem",]$FULL_NORM
detected <- prep_detected(detected)
df[df$VIDEO == "eminem",]$ARTIST <- detect(detected, c("eminem", "eminems+"))
df[df$VIDEO == "eminem",]$TITLE <- detect(detected, c("love", "u lie", "you lie", "lovetheway\\w*","lovethewayyoulie\\w*", "lovethewayulie\\w*"))

#shakira
detected <- df[df$VIDEO == "shakira",]$FULL_NORM
detected <- prep_detected(detected)
df[df$VIDEO == "shakira",]$ARTIST <- detect(detected, c("shakira", "shakiras+"))
df[df$VIDEO == "shakira",]$TITLE <- detect(detected, c("waka", "wakawaka\\w*", "this time", "time for africa", "thistimefor\\w*", "timeforafrica\\s*"))

The remaining features were created more simply. Based on my observations made earlier and discussed immediately beneath the word clouds, above, I wrote regular expressions to create my planned features, and then ran them over the fully normalized comments for the entire data set.

detected <- df$FULL_NORM
detected <- prep_detected(detected)

df$POLITE <- detect(detected, c("ple+a+s+e+", "thanks+"))
df$MODAL <- detect(detected, c("could", "would"))
df$SONG <- detect(detected, c("song", "songs+"))
df$MUSIC <- detect(detected, "music")
df$VID <- detect(detected, c("video\\w*", "vids", "vidz"))
df$CHECK <- detect(detected, "check")
df$MONEY <- detect(detected, c("money\\w*", "dollar\\w*", "cash"))
df$FREE <- detect(detected, "free")
df$HOME <- detect(detected, c("homes*", "houses*"))
df$CHANCE <- detect(detected, "chances*")
df$DONT <- detect(detected, "dont")
df$SITES <- detect(detected, c("websites*", "sites*", "pages*", "webpages*"))
df$FB <- detect(detected, c("facebooks*", "fb"))
df$VIEW <- detect(detected, "view")
df$VIEWS <- detect(detected, "views+")
df$WATCH <- detect(detected, "watch")
df$VISIT <- detect(detected, "visit")
df$PAN_ACTION <- detect(detected, c("visit", "watch", "view", "check"))
df$SUBSCIBE <- detect(detected, c("subsciptions*", "subscrib\\w*"))
df$CHANNEL <- detect(detected, "chan+els*")
df$YOUTUBE <- detect(detected, "youtu\\w*")
df$PAN_YOUTUBE <- detect(detected, c("subsciptions*", "subscrib\\w*", "chan+els*", "youtu\\w*"))
df$QUANTS <- detect(detected, c("\\w*hundred\\w*", "\\w*thousand\\w*", "\\w*million\\w*", "\\w*billion\\w*"))
df$EXCLAMATION <- detect(detected, c("wow", "lol", "ha", "haha\\w*", "hehe\\w*", "whoa", "omg"))
df$LIKE <- detect(detected, "like")
df$LOVE <- detect(detected, "love")
df$LOOKS <- detect(detected, c("beautiful", "pretty", "gorgeous", "ugly"))
df$POLARITY <- detect(detected, c("good", "bad", "best", "worst", "awesome", "horrible", "awful", "cool"))
df$QUESTIONS <- detect(detected, c("why", "who", "how", "what", "when", "where"))
df$TEMPORAL <- detect(detected, c("ago", "while", "until", "ago", "still", "ever", "never"))
df$THIRD <- detect(detected, c("he", "hes", "hed", "his", "she", "shes", "shed", "her", "hers"))
df$FIRSTPL <- detect(detected, c("we", "wed", "weve", "our", "ours"))
df$SECOND <- detect(detected, c("you", "youre*","yours", "youve"))

The last step of data preparation and feature building was selecting the label and the features that were to be used in the regression, and formatting them appropriately as either numeric values or as factors. I did not include any version of the comment text or its words in this final data frame, as I had already extracted the desired information about the text through my feature building. However, as I would eventually use the text iteslf in the development of the benchmark model based on token frequencies alone, I first created a separate “df.text” data frame with the fully normalized comments before I eliminated the text columns from the original data frame.

Further, some of the data columns I calculated were essentially redundant, and so were also omited from the final data frame. For example, “PERCENT_N” conveys the relationship between “N_LENGTH” and “LENGTH” for a comment. Instead of retaining all three columns of this information, I included for modeling only “LENGTH” and “PERCENT_N”. Lastly, though figuring out which authors are associated with spam (essentially developing a blacklist) could be quite effective in determining which comments are spam or not, I wanted to test the ability of only context- and text-informed features to model spam, so the “author” field was omitted from the final data frame.

text.train <- df[df$assignment == 0,] %>% select(IS_SPAM, FULL_NORM)
text.test <- df[df$assignment == 1,] %>% select(IS_SPAM, FULL_NORM)

df <- df %>% select(-AUTHOR, -DATE, -CONTENT, -A_LENGTH, -N_LENGTH, -AN_LENGTH, -link1, -link2, -RAW_WORDS, -NORMALIZED, -NORM_CHAR, -NORM_WORDS, -FULL_NORM)
df$A_ISMALE <- as.factor(df$A_ISMALE) 
df[,15:50] <- lapply(df[,15:50] , factor) # profanity to end

Intro
Data Prep
Data Splitting
Feature Building
Context-Based Features Model
Token Only Features Model
Conclusion

Data Exploration

After creating the final data frame, I took a look at plots and statistics for my proposed features and judged whether our or not my hypotheses about the differences between the character of spam and ham comments were borne out by the training data itself. Generally, I found that the features I created were helpful and identified parameters along which the spam and ham data differed to a statistically significant degree. For example, looking at the distributions of raw comment lengths measured in characters for all spam versus all ham (left plot, below), we can see that the distributions of the lengths are not equal. The longer and thicker tail of the violin plot for the character length of spam suggests that spam comments may vary more in length than ham comments do, and that they tend to be longer. The wider, flared base of the violin plot for the character length of ham, on the other hand, indicates that ham comments tend much more consistently to be shorter in terms of raw character length. Further, we see from the right plot, below, that this difference in character length between spam and ham is even more stark when the comparison is made between spam and ham comments posted to the same video.

dataSplit()
training <- training %>% select(-assignment)
testing <- testing %>% select(-assignment)

a <- ggplot(training, aes(IS_SPAM, LENGTH)) +
  geom_violin() +
  ggtitle("Character Length of Comment\nby Class") +
  xlab("Comment Class") + ylab("Comment Length in Characters") +
  scale_x_discrete(breaks=c("0", "1"), labels=c("Ham", "Spam"))

b <- ggplot(training, aes(VIDEO, LENGTH, fill = IS_SPAM)) +
  geom_violin() +
  ggtitle("Character Length of Comment\nby Video and Class") +
  xlab("Video") + 
  scale_fill_discrete(labels = c("Ham", "Spam")) + 
  guides(fill=guide_legend(title=NULL)) +
  theme(axis.title.y = element_blank()) +
  scale_x_discrete(breaks=c("eminem", "katyperry", "lmfao", "psy", "shakira"),
                      labels=c("Eminem", "Katy\nPerry", "LMFAO", "Psy", "Shakira"))
multiplot(a, b, cols = 2)

The differences between the distribution of character lengths for spam and ham were demonstrated to be of statistical significance with several tests. First, a t-test was applied. Though the distributions were not normally distributed given the very long tails to one extreme, I still applied the t-test as the number of observations was large for both groups. As we can see from the results of the Welch Two Sample T-Test (p < 0.001), the two distrbutions have a statistically significant difference in means. Further, the results of a two-sample Wilcoxon rank sum (or “Mann-Whitney”) test (p < 0.001), also indicated a statistically significant difference in the distributions’ medians, and those of a two-sample Kolmogorov-Smirnov test (p < 0.001) indicated that they had statistically significant differences in shape in addition to location.

t.test(training[training$IS_SPAM== 1,]$LENGTH, training[training$IS_SPAM== 0,]$LENGTH, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  training[training$IS_SPAM == 1, ]$LENGTH and training[training$IS_SPAM == 0, ]$LENGTH
## t = 13.873, df = 919.88, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   75.23811 100.03317
## sample estimates:
## mean of x mean of y 
## 135.64666  48.01102

wilcox.test(training[training$IS_SPAM== 1,]$LENGTH, training[training$IS_SPAM== 0,]$LENGTH, correct=FALSE, conf.int=TRUE)

## 
##  Wilcoxon rank sum test
## 
## data:  training[training$IS_SPAM == 1, ]$LENGTH and training[training$IS_SPAM == 0, ]$LENGTH
## W = 397360, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  26.99996 36.00001
## sample estimates:
## difference in location 
##               30.99999

ks.test(training[training$IS_SPAM== 1,]$LENGTH, training[training$IS_SPAM== 0,]$LENGTH, alternative = "two.sided")

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  training[training$IS_SPAM == 1, ]$LENGTH and training[training$IS_SPAM == 0, ]$LENGTH
## D = 0.39772, p-value < 2.2e-16
## alternative hypothesis: two-sided

Not all of the numeric features showed more variety or dynamism in the spam comments than in the ham. For example, the percent of those characters that are alphabetic (that is, ‘a-z’ or ‘A-Z’) shows a wider range of values for ham than spam, and has a very different distribution. Looking at the left plot, below, we can see that ham comments are much more likely to contain only alphabetic characters, or none at all. Further, though the full range of percentages appears to be approximately the same for spam and ham in the general population, when plotted by video we can see that spam for a given video tends to have a noticeably smaller range of percentages of alphabetic characters than does ham for the same video.

The results for the statistical tests for this feature in particular were interesting. While the two distributions above have very different shapes, they actually have very similar means and medians. As such, neither the T-Test nor the Wilcoxon rank sum test indicated a statistically significant difference between the two distributions.

## 
##  Welch Two Sample t-test
## 
## data:  training[training$IS_SPAM == 1, ]$PERCENT_A and training[training$IS_SPAM == 0, ]$PERCENT_A
## t = 1.663, df = 1240.4, p-value = 0.09657
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.001928211  0.023384289
## sample estimates:
## mean of x mean of y 
##  0.764023  0.753295

## 
##  Wilcoxon rank sum test
## 
## data:  training[training$IS_SPAM == 1, ]$PERCENT_A and training[training$IS_SPAM == 0, ]$PERCENT_A
## W = 271750, p-value = 0.4812
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -0.004616832  0.009703949
## sample estimates:
## difference in location 
##            0.002603246

However, the distributions are different to a statistically significant degree, confirmed by the significant p-value of the KS test (p < 0.001), which is sensitive to both distribution shape as well as location.

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  training[training$IS_SPAM == 1, ]$PERCENT_A and training[training$IS_SPAM == 0, ]$PERCENT_A
## D = 0.11295, p-value = 0.0001816
## alternative hypothesis: two-sided

Word counts, too, (such as the alphanumerc word count, below) show statistically significant differences between the general ham and spam populations.

## 
##  Welch Two Sample t-test
## 
## data:  training[training$IS_SPAM == 1, ]$ALPHANUM_WC and training[training$IS_SPAM == 0, ]$ALPHANUM_WC
## t = 13.198, df = 941.46, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  13.00528 17.54857
## sample estimates:
## mean of x mean of y 
## 24.421555  9.144628

## 
##  Wilcoxon rank sum test
## 
## data:  training[training$IS_SPAM == 1, ]$ALPHANUM_WC and training[training$IS_SPAM == 0, ]$ALPHANUM_WC
## W = 385750, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  4.000067 5.999944
## sample estimates:
## difference in location 
##               5.000021

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  training[training$IS_SPAM == 1, ]$ALPHANUM_WC and training[training$IS_SPAM == 0, ]$ALPHANUM_WC
## D = 0.3482, p-value < 2.2e-16
## alternative hypothesis: two-sided

In addition to being sensitive to the video of the comment, many of the features are also sensitive to the gender of the video’s performer. For example, alpha-numeric word count is also sensitive to the performer’s gender.

## 
##  Welch Two Sample t-test
## 
## data:  training[training$A_ISMALE == 1, ]$ALPHANUM_WC and training[training$A_ISMALE == 0, ]$ALPHANUM_WC
## t = -3.5847, df = 955.97, p-value = 0.0003545
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.415756 -2.168750
## sample estimates:
## mean of x mean of y 
##  15.01320  19.80545

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  training[training$A_ISMALE == 1, ]$ALPHANUM_WC and training[training$A_ISMALE == 0, ]$ALPHANUM_WC
## D = 0.10353, p-value = 0.00129
## alternative hypothesis: two-sided

a <- ggplot(training, aes(VIDEO, ALPHANUM_WC, fill = IS_SPAM)) +
  geom_violin() +
  ggtitle("AlphaNumeric Word Count\nby Video") +
  xlab("Video") + ylab("AlphaNumeric Word Count") +
  scale_fill_discrete(labels = c("Ham", "Spam")) + 
  theme(legend.position="none") +
  scale_x_discrete(breaks=c("eminem", "katyperry", "lmfao", "psy", "shakira"),
                      labels=c("Eminem", "Katy\nPerry", "LMFAO", "Psy", "Shakira"))

b <- ggplot(training, aes(A_ISMALE, ALPHANUM_WC, fill = IS_SPAM)) +
  geom_violin() +
  ggtitle("AlphaNumeric Word Count\nby Performer Gender") +
  xlab("Performer Gender") + 
  scale_fill_discrete(labels = c("Ham", "Spam")) + 
  guides(fill=guide_legend(title=NULL)) +
  theme(axis.title.y = element_blank()) +
  scale_x_discrete(breaks=c("0","1"), labels=c("Female", "Male"))

multiplot(a, b, cols = 2)

In addition to comparing the distributions of the numeric features for spam to those of ham, I also used plots and statistical tests evaluate the binary context-, word-, and word class-based features and their associations with spam or ham. Since evaluating these these binary features involves looking at their co-occurence with spam or ham (itself a binary feature), I used a chi-squared test to determine if the difference of the proportions of ham and spam with a given feature was statistically significant.

For example, we can see from the following Chi-Squared test that the difference between the proportion of ham and spam comments that include profanity is statistically significant, with a very low p-value (p = 0.004) and a 95% confidence interval that doesn’t overlap 0.

prop.test(table(training$PROFANITY, training$IS_SPAM), correct=FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$PROFANITY, training$IS_SPAM)
## X-squared = 8.177, df = 1, p-value = 0.004242
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.3300386 -0.0737666
## sample estimates:
##    prop 1    prop 2 
## 0.4904051 0.6923077

However, for some features even when there is a statistically significant difference between the proportion of spam and ham in the general population, the pattern isn’t necessarily found in all subsets of the population, such as within the comments associated with a single video. For example, the test above demonstrated that the difference in the proportions of overall ham and spam that contain profanity is statistically significant, yet when the counts are calculated by video we see that the comments for Shakira and Eminem’s videos clearly don’t exhibit the same relationship to profanity. Still, given the generally indicative nature of the presence of profanity, this was still a good feature to include in the model.

ggplot(training[training$PROFANITY==1,], aes(VIDEO, ..count..)) +
  geom_bar(aes(fill=IS_SPAM), color = "black", position = "dodge") +
  ggtitle("Contains Profanity, by Video and Class") +
  xlab("Video") + ylab("Comment Count") + 
  scale_fill_discrete(labels = c("Ham", "Spam")) + 
  guides(fill=guide_legend(title=NULL)) + 
    scale_x_discrete(breaks=c("eminem", "katyperry", "lmfao", "psy", "shakira"),
                      labels=c("Eminem", "Katy\nPerry", "LMFAO", "Psy", "Shakira"))

Just as the presence profanity appears to be a decent indicator of a ham comment, we can see that both of the features that indicate politeness (the ‘polite words’ and the modal verbs) are also significant features, though for spam. This is indicated by p < 0.001 for both features as determined by chi-squared tests without correction.

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$POLITE, training$IS_SPAM)
## X-squared = 192.79, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.5056834 0.5768396
## sample estimates:
##     prop 1     prop 2 
## 0.56771654 0.02645503

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$MODAL, training$IS_SPAM)
## X-squared = 27.478, df = 1, p-value = 1.589e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.2402342 0.4349148
## sample estimates:
##    prop 1    prop 2 
## 0.5121777 0.1746032

Unlike the presence of profanity, however, we can see from the following plots that the presence of the politeness markers are very consistenly associated with a single comment class (here, spam) across all of the videos.

a <- ggplot(training[training$POLITE==1,], aes(VIDEO, ..count..)) +
  geom_bar(aes(fill = IS_SPAM), color = "black", position = "dodge") +
  ggtitle("Uses 'Please' and 'Thank You'") +
  xlab("Video") + ylab("Comment Count") +
  theme(legend.position="none") +
  scale_x_discrete(breaks=c("eminem", "katyperry", "lmfao", "psy", "shakira"),
                      labels=c("Eminem", "Katy\nPerry", "LMFAO", "Psy", "Shakira"))
b <- ggplot(training[training$MODAL ==1,], aes(VIDEO, ..count..)) +
  geom_bar(aes(fill = IS_SPAM),color = "black", position = "dodge") +
  ggtitle("Uses 'Could' or 'Would'") +
  xlab("Video") + 
  scale_fill_discrete(labels = c("Ham", "Spam")) + 
  guides(fill=guide_legend(title=NULL)) +
  theme(axis.title.y = element_blank()) +
  scale_x_discrete(breaks=c("eminem", "katyperry", "lmfao", "psy", "shakira"),
                      labels=c("Eminem", "Katy\nPerry", "LMFAO", "Psy", "Shakira"))
multiplot(a, b, cols = 2)

Some of the features (in particular the mention of a website or of Facebook, containing a link, or specifically using the word “visit”) are associated entirely or almost entirely with spam, as we can see from spam’s domination of the following plots.

Many of the features that tend to be associted with ham, on the other hand, aren’t as exclusively associated with this comment class as the above features are with the spam class. As we can see from the following plots, while more often a comment that mentions the performer’s name, the song title, uses an exclamation, or mentions a large quantity is associated with ham than spam, it is not the case that these word categories are exclusive or nearly exclusive to ham.

However, though these particular features occur in both ham and spam comments, their presence is still a good indicator of ham, indicated by the statistically significant difference between the proportion spam and ham comments that exhibit these features (p < 0.001 for all four features).

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$ARTIST, training$IS_SPAM)
## X-squared = 41.371, df = 1, p-value = 1.259e-10
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.3428255 -0.1960698
## sample estimates:
##    prop 1    prop 2 
## 0.4680523 0.7375000

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$TITLE, training$IS_SPAM)
## X-squared = 25.52, df = 1, p-value = 4.378e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.3403034 -0.1648505
## sample estimates:
##    prop 1    prop 2 
## 0.4789045 0.7314815

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$EXCLAMATION, training$IS_SPAM)
## X-squared = 18.468, df = 1, p-value = 1.728e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.4075896 -0.1831122
## sample estimates:
##    prop 1    prop 2 
## 0.4864672 0.7818182

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$QUANTS, training$IS_SPAM)
## X-squared = 30.137, df = 1, p-value = 4.025e-08
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.4514741 -0.2610204
## sample estimates:
##    prop 1    prop 2 
## 0.4824624 0.8387097

We can also compare the statistical significance of my broader theme features (like “Pan-YouTube”) to those of the individual features used to construct them. Looking at the results of the chi-squared tests we see that while all four of these features are statistically signficant, the broader theme feature has a more extreme (further from 0) 95% confidence interval, suggesting that creating (sensible) theme-based features based on terms related by semantic frames effectively magnifies the effect that the component words would have had on their own. (Compare the 95% confidence intervals of 0.52-0.59, 0.50-0.57, and 0.38-0.49 for the individual features to 0.57-0.65 of the broad theme feature that combines all of these terms.)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$SUBSCIBE, training$IS_SPAM)
## X-squared = 200.26, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.5202956 0.5854764
## sample estimates:
##     prop 1     prop 2 
## 0.56884343 0.01595745

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$CHANNEL, training$IS_SPAM)
## X-squared = 147.11, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.5025190 0.5687805
## sample estimates:
##     prop 1     prop 2 
## 0.54973424 0.01408451

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$YOUTUBE, training$IS_SPAM)
## X-squared = 126.29, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.3786685 0.4859371
## sample estimates:
##    prop 1    prop 2 
## 0.5553797 0.1230769

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(training$PAN_YOUTUBE, training$IS_SPAM)
## X-squared = 445.29, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.5712936 0.6454232
## sample estimates:
##     prop 1     prop 2 
## 0.67439614 0.06603774

Intro
Data Prep
Data Splitting
Feature Building
Data Exploration
Token Only Features Model
Conclusion

Model Fitting

After exploring the data and confirming that the features I created were promising, I fit a logistic regression model to the data using elastic net regularization. The logistic regression was chosen given the fact that the data’s labels were binary (essentially “is spam” or “isn’t spam”), and elastic net regularization was chosen since the data exploration demonstrated that there were numerous interactions between features and this method would simply weight them rather than eliminating them entirely if they contribute to an interaction effect that I didn’t notice but still do not produce an effect on their own (which is a danger of stepwise methods). The elastic net regularization was also chosen since I had intentionally created some features that were not independent of each other, and so needed a non-parametric method that made no assumptions about the independence of features.

Before developing the model, I scaled the numeric columns for both the training and testing data using using Z-Score normalization.

numerics <- c("LENGTH", "PERCENT_A", "PERCENT_N", "PERCENT_NONAN", "RAW_WC", "ALPHANUM_WC", "MAX_WLEN", "MIN_WLEN", "MEAN_WLEN", "NORM_WC")
for (x in numerics){
  training[,x] <- scale(training[,x])
  testing[,x] <- scale(testing[,x])}

Then, I created a model matrix from the training data, including all of the features except the ‘intercept’ column. I then computed the singular value decomposition of this model matrix, and used a histogram of the the singular values to look for any outrageously large values.

require(glmnet)

mod.mx <- model.matrix(IS_SPAM ~ ., data = training)
mod.mx <- mod.mx[,2:ncol(mod.mx)] 
#mod.mx[1:2,]
mod.svd <- svd(mod.mx)
hist(mod.svd$d)

In order to use the model for predictions using the “new” testing data, I also created model matrix from the testing data, once again eliminating the ‘intercept’ column.

test.mod.mx <- model.matrix(IS_SPAM ~ ., data = testing)
test.mod.mx <- test.mod.mx[,2:ncol(test.mod.mx)]

Then, using elastic net regression (a combination of ridge and lasso methods, weighting both methods equally using alpha = 0.50) and the training data model matrix, I calculated the model coefficients for 20 different values of the shrinkage/tuning parameter lambda.

mod.elnet = glmnet(mod.mx, training$IS_SPAM, family = 'binomial', nlambda = 20, alpha = 0.5)
par(mfrow = c(1,2))
plot(mod.elnet, xvar = 'lambda', label = TRUE)
plot(mod.elnet, xvar = 'dev', label = TRUE)

Finally, I created a function that would display key metrics in order to evaluate and compare the performance of the models at different lambda values, and subsequently to compare the performance of these models to those based on stemmed tokens alone. The metrics included were Recall (the True Postive Rate), Specificity (the True Negative Rate), Fallout (the False Positive Rate), Miss Rate (the False Negative Rate), Precision (or Positive Predictive Value), Accuracy (the percentage of the population that was correctly classified), and F1 (the harmonic mean of precision and recall).

The performance of the the model at lambda values 5, 10, 15, 18, and 20 were then assessed:

getMetrics <- function(dataFrame, modOut, modIn, l, src = "Enriched"){
  dataFrame[,"PREDICTION"] <- predict(modOut, newx = modIn, type="class")[,l]
  TP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "1",])
  FP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "1",])
  TN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "0",])
  FN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "0",])
  TPR <- TP/(TP+FN)
  TNR <- TN/(TN+FP)
  FNR <- 1-TPR
  FPR <- 1-TNR
  PPV  <- TP/(TP+FP)
  acc <- (TP+TN)/(TP+TN+FP+FN)
  f1 <- (2*TP)/((2*TP)+FP+FN)
  cat("Performance for Model from", src, "Features, lambda =", l, ", Degrees of Freedom =", mod.elnet$df[l], "\n", "TPR:", round(TPR,3), "\tTNR:", round(TNR,3), "\tFNR:", round(FNR,3), "\tFPR:", round(FPR,3), "\nPrecision:", round(PPV,3), "\tRecall:", round(TPR,3), "\tAccuracy:", round(acc,3), "\tF1:", round(f1,3), "\n\n", collapse = " ")}

getMetrics(testing, mod.elnet, test.mod.mx, l = 5)

## Performance for Model from Enriched Features, lambda = 5 , Degrees of Freedom = 19 
##  TPR: 0.886  TNR: 0.978  FNR: 0.114  FPR: 0.022 
## Precision: 0.98  Recall: 0.886   Accuracy: 0.928     F1: 0.931 
## 
##

getMetrics(testing, mod.elnet, test.mod.mx, l = 10)

## Performance for Model from Enriched Features, lambda = 10 , Degrees of Freedom = 45 
##  TPR: 0.919  TNR: 0.978  FNR: 0.081  FPR: 0.022 
## Precision: 0.98  Recall: 0.919   Accuracy: 0.946     F1: 0.949 
## 
##

getMetrics(testing, mod.elnet, test.mod.mx, l = 15)

## Performance for Model from Enriched Features, lambda = 15 , Degrees of Freedom = 48 
##  TPR: 0.923  TNR: 0.973  FNR: 0.077  FPR: 0.027 
## Precision: 0.977     Recall: 0.923   Accuracy: 0.946     F1: 0.949 
## 
##

getMetrics(testing, mod.elnet, test.mod.mx, l = 18)

## Performance for Model from Enriched Features, lambda = 18 , Degrees of Freedom = 52 
##  TPR: 0.923  TNR: 0.969  FNR: 0.077  FPR: 0.031 
## Precision: 0.973     Recall: 0.923   Accuracy: 0.944     F1: 0.947 
## 
##

getMetrics(testing, mod.elnet, test.mod.mx, l = 20)

## Performance for Model from Enriched Features, lambda = 20 , Degrees of Freedom = 52 
##  TPR: 0.923  TNR: 0.969  FNR: 0.077  FPR: 0.031 
## Precision: 0.973     Recall: 0.923   Accuracy: 0.944     F1: 0.947 
## 
##

Ultimately the model with the best performance was that using lambda value 15 (the model with 48 degrees of freedom). Its strength can be seen across its metrics: although it has the same recall as the models with 52 degrees of freedom (92.3%), it has the higest accuracy and F1 values of the models shown (94.6% and 94.9%, respectively), as well as the highest values for both precision (97.7%). Looking more closely at the rates of true and false positive and negative predictions made, we can see that while the model at 48 degrees of freedom does a good job of classifying actual spam as spam (TPR of 92.3%/FNR of 7.7%), it does even better at not classifying ham as spam (TNR of 97.3%/FPR of 2.7%).

Intro
Data Prep
Data Splitting
Feature Building
Data Exploration
Context-Based Features Model
Conclusion

Enriched Features Model vs. Stemmed Token-Only Model

To compare the Context-Based Features model, above, to the performance of a token only-based model, I created another logistic regression model, but using as features the presence or absence of the most frequent terms of the training data set. To do this, I used the fully normalized comment text that had been saved in the “text.train” data frame. For this model, unlike that with contextual-based features, I did want to aggressively remove stopwords since I wanted to limit the features to the most frequent but informative words. Using the “SMART” stopwords list from the package tm, and subsequently reducing redundant whitespace characters and removing leading and trailing spaces. Finally, a new column of class character string (‘CORP’) was added to the data frame and was populated with the stemmed versions of the pre-processed comments. The entire process was then repeated for the data frame of testing data. a sample of the fully normalized and the final stemmed comments is demonstrated below.

text.train$FULL_NORM <- tm::removeWords(x = text.train$FULL_NORM, stopwords(kind = "SMART"))
text.train$FULL_NORM <- sapply(text.train$FULL_NORM, function(x) str_replace_all(x," +"," "))
text.train$FULL_NORM <- sapply(text.train$FULL_NORM, function(x) stri_trim_both(x, pattern="\\P{Wspace}"))
text.train$CORP <- ""

i = 1
while (i < nrow(text.train)){
  temp <- unlist(strsplit(text.train[i,]$FULL_NORM, split = ' '))
  temp <- stemDocument(temp)
  temp.str <- str_c(temp, collapse=" ")
  text.train[i,]$CORP <- temp.str
  i = i + 1
}


text.test$FULL_NORM <- tm::removeWords(x = text.test$FULL_NORM, stopwords(kind = "SMART"))
text.test$FULL_NORM <- sapply(text.test$FULL_NORM, function(x) str_replace_all(x," +"," "))
text.test$FULL_NORM <- sapply(text.test$FULL_NORM, function(x) stri_trim_both(x, pattern="\\P{Wspace}"))
text.test$CORP <- ""

i = 1
while (i < nrow(text.test)){
  temp <- unlist(strsplit(text.test[i,]$FULL_NORM, split = ' '))
  temp <- stemDocument(temp)
  temp.str <- str_c(temp, collapse=" ")
  text.test[i,]$CORP <- temp.str
  i = i + 1
}

text.train[3:6,c("FULL_NORM","CORP")]

## # A tibble: 4 x 2
##                                        FULL_NORM
##                                            <chr>
## 1 hey check website site kids stuff kidsmediausa
## 2                 turned mute wanted check views
## 3                                  check channel
## 4                                  hey subscribe
## # ... with 1 more variables: CORP <chr>

Since I was simply selecting the most prevalent (but informative) terms for this model, I did not bother to create word clouds. Given that, I relied solely on stem frequency, and found the most prevalent stems directly in content of the ‘CORP’ column itself without having to bother with creating a document-term matrix. This was achieved using the package qdap. The output of the package’s freq_terms function is both the words and their frequencies, and a sample of this output for the top 5 stems is printd out, below. To select the specific words to use as model features, I selected just the words themselves from the output of the freq_terms function and assigned them to a list, liimiting the results to just the top 50 stems as I wanted to create a model using approximately the same number of features as was used in the original context-based features model.

require(qdap)
freq_terms(text.var = text.train$CORP, top = 5)

##   WORD     FREQ
## 1 check     401
## 2 video     286
## 3 song      256
## 4 subscrib  207
## 5 youtub    206

fqterms <- freq_terms(text.var = text.train$CORP, top = 50)
fqterms <- fqterms$WORD

Then, using the same prep_detected and detect functions I used for the original model, I created features for these top 50 stems by searching for them within the stemmed version of the comments. This process was then repeated for the testing data.

for (x in fqterms){text.train[,x] <- 0}
text.train$CORP <- prep_detected(text.train$CORP)
for (x in fqterms){text.train[,x] <- detect(text.train$CORP, x)}
text.train <- text.train %>% select(-FULL_NORM, -CORP)
text.train[,1:ncol(text.train)] <- lapply(text.train[,1:ncol(text.train)], factor)

for (x in fqterms){text.test[,x] <- 0}
text.test$CORP <- prep_detected(text.test$CORP)
for (x in fqterms){text.test[,x] <- detect(text.test$CORP, x)}
text.test <- text.test %>% select(-FULL_NORM, -CORP)
text.test[,1:ncol(text.test)] <- lapply(text.test[,1:ncol(text.test)], factor)

Two model matrices were then created using “IS_SPAM” as the label and all other columns as features: one for the training data, and another for the testing data.

mod.mx2 <- model.matrix(IS_SPAM ~ ., data = text.train)
mod.mx2 <- mod.mx2[,2:ncol(mod.mx2)]  # remove 'intercept' column

test.mod.mx2 <- model.matrix(IS_SPAM ~ ., data = text.test)
test.mod.mx2 <- test.mod.mx2[,2:ncol(test.mod.mx2)]

Once again, a logistic regression was created and subsequently fit to the training data using elastic net regularization with the same alpha (0.5) and lambda values (5, 10, 15, 18, and 20) I had used for the previous model.

mod.elnet2 = glmnet(mod.mx2, text.train$IS_SPAM, family = 'binomial', nlambda = 20, alpha = 0.5)
par(mfrow=c(1,2))
plot(mod.elnet2, xvar = 'lambda', label = TRUE)
plot(mod.elnet2, xvar = 'dev', label = TRUE)

Finally, the performance metrics were calculated for this model at several different lambda values. The results were compared to each other, and to those of the best fit of the original model.

getMetrics(text.test, mod.elnet2, test.mod.mx2, l = 5, src = "Token-Only")

## Performance for Model from Token-Only Features, lambda = 5 , Degrees of Freedom = 19 
##  TPR: 0.761  TNR: 0.987  FNR: 0.239  FPR: 0.013 
## Precision: 0.986     Recall: 0.761   Accuracy: 0.863     F1: 0.859 
## 
##

getMetrics(text.test, mod.elnet2, test.mod.mx2, l = 10, src = "Token-Only")

## Performance for Model from Token-Only Features, lambda = 10 , Degrees of Freedom = 45 
##  TPR: 0.827  TNR: 0.956  FNR: 0.173  FPR: 0.044 
## Precision: 0.957     Recall: 0.827   Accuracy: 0.885     F1: 0.888 
## 
##

getMetrics(text.test, mod.elnet2, test.mod.mx2, l = 15, src = "Token-Only")

## Performance for Model from Token-Only Features, lambda = 15 , Degrees of Freedom = 48 
##  TPR: 0.831  TNR: 0.956  FNR: 0.169  FPR: 0.044 
## Precision: 0.958     Recall: 0.831   Accuracy: 0.887     F1: 0.89 
## 
##

getMetrics(text.test, mod.elnet2, test.mod.mx2, l = 18, src = "Token-Only")

## Performance for Model from Token-Only Features, lambda = 18 , Degrees of Freedom = 52 
##  TPR: 0.831  TNR: 0.956  FNR: 0.169  FPR: 0.044 
## Precision: 0.958     Recall: 0.831   Accuracy: 0.887     F1: 0.89 
## 
##

getMetrics(text.test, mod.elnet2, test.mod.mx2, l = 20, src = "Token-Only")

## Performance for Model from Token-Only Features, lambda = 20 , Degrees of Freedom = 52 
##  TPR: 0.831  TNR: 0.956  FNR: 0.169  FPR: 0.044 
## Precision: 0.958     Recall: 0.831   Accuracy: 0.887     F1: 0.89 
## 
##

Looking at the results from the Token-Only models at different lambda values, we see that models with lambdas of 15, 18, and 20 (48, 52, and 52 degrees of freedom, respectively) perform exactly the same. This performance is superior to that of the models at lambda values 5 and 10 (19 and 45 degrees of freedom), as we can see from the greater values for recall (83.1 > 82.7 > 76.1%), accuracy (88.7 > 88.5 > 86.3%), and F1 (89 > 88.8 > 85.9%) even though the model with lambda=5 has the best precision (98.6 vs. 95.7, 95.8%). Of these fits for the token-only model, then, we can take the coefficients calculated using lambda=15 (48 degrees of freedom) to be the best option.

However, comparing this best fit of the token-only model to the best fit of the context-based features model, we can see that the latter performs better. This is because the metrics are, across the board, better: the values for precision (97.7 vs. 95.8%), recall (92.3 vs. 83.1%), accuracy (94.6 vs 88.7%), and F1 (94.9% vs. 89.0%) values for the context-based features model are clearly higher. Further, looking at the TPRs and the FNRs we can see that the context-based features model classifies more spam as spam (compare TPR 92.3 to 83.1%), and from the TNRs and the FPRs we can see that the context-based features model also classifies less non-spam as spam, too (compare 97.3 to 95.6%). Given this all-around better performance, the better choice of models of these options is clear: context-based.

getMetrics(text.test, mod.elnet2, test.mod.mx2, l = 15, src = "Token-Only")

## Performance for Model from Token-Only Features, lambda = 15 , Degrees of Freedom = 48 
##  TPR: 0.831  TNR: 0.956  FNR: 0.169  FPR: 0.044 
## Precision: 0.958     Recall: 0.831   Accuracy: 0.887     F1: 0.89 
## 
##

getMetrics(testing, mod.elnet, test.mod.mx, l = 15)

## Performance for Model from Enriched Features, lambda = 15 , Degrees of Freedom = 48 
##  TPR: 0.923  TNR: 0.973  FNR: 0.077  FPR: 0.027 
## Precision: 0.977     Recall: 0.923   Accuracy: 0.946     F1: 0.949 
## 
##

Intro
Data Prep
Data Splitting
Feature Building
Data Exploration
Context-Based Features Model
Token Only Features Model