Optimizing YouTube Spam Detection with Machine Learning¶

Anna E. Jurgensen

In my project for Data Science 350b I demonstrated that a predictive model built using logistic regression with elastic net (L1 + L2) regularization and based on contextually-informed features performed better at classifying spam in YouTube comments than did a model based on an equivalent number of frequent non-stopword tokens. As an extension of that project, I subsequently applied several different machine learning algorithms to the contextually-informed feature set for the same training and testing data to identify the optimal performance that could be achieved given the choice of features developed. The machine learning algorithms used for the comparison were support vector machines with linear, polynomial, and radial kernels, a classification tree with and without pruning, random forest, and bagging. Ultimately it was revealed that the performance of the original regression model was on par with those of the machine learning models, and that the models with the best performance, depending on the priority of maximal precision or recall, were the models developed with the Support Vector Machine with a radial kernel or with Bagging.

Data Used¶

For this project I used Tiago A. Almeida and Tulio C. Alberto's “YouTube Spam Collection” data set available from both the UC Irvine machine learning data sets and from Almeida and Alberto's website. The set contains 1,956 public comments posted to five popular music videos from the same general time period: “Gangnam Style” by Psy, “Roar” by Katy Perry, “Rock Party Anthem” by LMFAO, “Love The Way You Lie” by Eminem, and “Waka Waka (This Time for Africa)” by Shakira. In addition to the comments themselves, the set contains the comment IDs, the comments' authors, the date and time of posting, and the categorization of each comment as spam or non-spam (‘ham’) as made by Almeida and Alberto.

The set of 1,956 observations was split into a training and testing set by randomly assigning 75% of the observations to training (1,468 observations) and the remaining 25% to testing (488 observations).

In [2]:
nrow(training)
nrow(testing)
1468
488

Features Created¶

The features I selected for modeling were based on a number of patterns in the use of individual words, as well as classes of words, that I had observed in my exploration of the comment data. For example, in comparing the words that occured more frequently in one class of comments than the other I found that profanity ("damn", "fuck", etc.) occured much more frequently in non-spam and polite words ("thanks", "thank you", and "please") occured much more frequently in spam. Additionally, I had hypothesized that there would be salient differences in the typographic style of spam and non-spam comments, measured by the percentage of the comment that was alphabetic characters, numeric characters, or non-alphanumeric characters, as well as comment length in characters and word counts at various stages of text processing. For a more complete explanation of the features I selected, my rationale for selecting them, and the regular expressions I used to compute the features from the comment text, see the "Feature Building" section of my original project report.

Logistic Regression Model¶

The original model developed was a logistic regression model with elastic net regularization using the glmnet package. The model with 49 degrees of freedom (DoF) produces the preferred results. Though models with fewer degrees of freedom have slightly higher precision (97.0-97.4% vs. 96.2%), the model with 49 DoF has the highest recall of any of the models tested (90.8% vs. 90.4% or less). This is a valid trade-off given the already very high precision and relatively lower recall.

In [3]:
require(glmnet)
mod.mx <- model.matrix(IS_SPAM ~ ., data = training)
mod.mx <- mod.mx[,2:ncol(mod.mx)] 

test.mod.mx <- model.matrix(IS_SPAM ~ ., data = testing)
test.mod.mx <- test.mod.mx[,2:ncol(test.mod.mx)] 

mod.elnet = glmnet(mod.mx, training$IS_SPAM, family = 'binomial', nlambda = 20, 
                   alpha = 0.5)

getMetrics <- function(dataFrame, modOut, modIn, l, src = "Enriched"){
  dataFrame[,"PREDICTION"] <- predict(modOut, newx = modIn, type="class")[,l]
  TP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "1",])
  FP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "1",])
  TN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "0",])
  FN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "0",])
  TPR <- TP/(TP+FN)
  TNR <- TN/(TN+FP)
  FNR <- 1-TPR
  FPR <- 1-TNR
  PPV  <- TP/(TP+FP)
  acc <- (TP+TN)/(TP+TN+FP+FN)
  f1 <- (2*TP)/((2*TP)+FP+FN)
  cat("Performance for Model from", src, "Features, lambda =", l, 
      ", Degrees of Freedom =", mod.elnet$df[l],  
      "\nPrecision:", round(PPV,3), "\tRecall:", round(TPR,3), "\tAccuracy:", 
      round(acc,3), "\tF1:", round(f1,3), "\n\n", collapse = " ")}

getMetrics(testing, mod.elnet, test.mod.mx, l = 5)
getMetrics(testing, mod.elnet, test.mod.mx, l = 10)
getMetrics(testing, mod.elnet, test.mod.mx, l = 13)
getMetrics(testing, mod.elnet, test.mod.mx, l = 15) # use this 
getMetrics(testing, mod.elnet, test.mod.mx, l = 18)
getMetrics(testing, mod.elnet, test.mod.mx, l = 20)
Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5

Performance for Model from Enriched Features, lambda = 5 , Degrees of Freedom = 18 
Precision: 0.952 	Recall: 0.873 	Accuracy: 0.912 	F1: 0.911 

  Performance for Model from Enriched Features, lambda = 10 , Degrees of Freedom = 41 
Precision: 0.974 	Recall: 0.904 	Accuracy: 0.939 	F1: 0.938 

  Performance for Model from Enriched Features, lambda = 13 , Degrees of Freedom = 47 
Precision: 0.97 	Recall: 0.904 	Accuracy: 0.936 	F1: 0.936 

  Performance for Model from Enriched Features, lambda = 15 , Degrees of Freedom = 49 
Precision: 0.962 	Recall: 0.908 	Accuracy: 0.934 	F1: 0.934 

  Performance for Model from Enriched Features, lambda = 18 , Degrees of Freedom = 51 
Precision: 0.962 	Recall: 0.908 	Accuracy: 0.934 	F1: 0.934 

  Performance for Model from Enriched Features, lambda = 20 , Degrees of Freedom = 52 
Precision: 0.962 	Recall: 0.908 	Accuracy: 0.934 	F1: 0.934 

  

Support Vector Machines¶

SVM with Linear Kernel¶

Using the e1071 package and its function tune, its built-in 10 fold cross-validation for model tuning, I then used a support vector machine with a linear kernel with the same training data to develop and compare models at seven different values of the cost parameter. Of the cost values tested (.001, .01, .1, 1, 5, 10, 100), the value 5 resulted in the smallest training error.

In [4]:
require(e1071)
tune.out <- tune(svm, IS_SPAM ~., data = training, kernel = "linear", 
                 ranges = list(cost=c(.001, .01, .1, 1, 5, 10, 100)))
summary(tune.out)
Loading required package: e1071
Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
    5

- best performance: 0.0510903 

- Detailed performance results:
   cost      error dispersion
1 1e-03 0.19411518 0.03973172
2 1e-02 0.07697791 0.02923630
3 1e-01 0.06539931 0.02273429
4 1e+00 0.05994316 0.01970865
5 5e+00 0.05109030 0.01546464
6 1e+01 0.05177057 0.01762162
7 1e+02 0.05177057 0.01544458

The performance of the SVM with linear kernel is practically the same as that of the linear regression with regularization, as it has only slightly higher rates of Precision (97.0% vs 96.2%) and Accuracy (93.9% vs 93.4%), and has precisely the same Recall (90.8%).

In [5]:
best.mod <- tune.out$best.model

results <- testing %>% dplyr::select(IS_SPAM)
get_results <- function(dataFrame){
  TP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "1",])
  FP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "1",])
  TN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "0",])
  FN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "0",])
  TPR <- TP/(TP+FN)
  TNR <- TN/(TN+FP)
  FNR <- 1-TPR
  FPR <- 1-TNR
  PPV  <- TP/(TP+FP)
  acc <- (TP+TN)/(TP+TN+FP+FN)
  f1 <- (2*TP)/((2*TP)+FP+FN)
  cat("\nPrecision:", round(PPV,3), "\tRecall:", round(TPR,3), 
      "\tAccuracy:", round(acc,3), "\tF1:", round(f1,3), "\n\n", collapse = " ")}

results$PREDICTION <- predict(best.mod, testing[,2:length(testing)])
get_results(results)
Precision: 0.97 	Recall: 0.908 	Accuracy: 0.939 	F1: 0.938 

  

SVM with Polynomial Kernel¶

Applying an SVM with a polynomial kernel to the data results in a model with poorer performance than those from either the linear regression or SVM with a linear kernel. Though the SVM with polynomial kernel does result in model that classifies the spam of the test data with much greater precision than the previous models (98.6% vs. 96.2-97.0%), but under-classifies the spam, as indicated by the much lower recall (81.7% vs. 90.8%).

In [ ]:
svm.poly <- tune(svm, IS_SPAM ~., data = training, kernel = "polynomial", 
                 ranges = list(cost=c(.001, .01, .1, 1, 5, 10, 100)))
#summary(svm.poly)  
best.mod <- svm.poly$best.model
results$PREDICTION <- predict(best.mod, testing[,2:length(testing)])
get_results(results)
Precision: 0.986 	Recall: 0.817 	Accuracy: 0.9 	F1: 0.893 

  

SVM with Radial Kernel¶

The model developed with a SVM with a radial kernel, on the other hand, performs better than those from linear regression and the SVM with a linear kernel. Not only does the radial kernel SVM model match the high precision of these previous models (97.0%), it even provides a slight boost in recall (91.6% vs. 90.8%). Given that the recall lags behind the precision for these models in general, this is model provides definitely offers a slight advantage.

In [7]:
svm.radial <- tune(svm, IS_SPAM ~., data = training, kernel = "radial", 
                   ranges = list(cost=c(.001, .01, .1, 1, 5, 10)))
#summary(svm.radial)  
best.mod <- svm.radial$best.model
results$PREDICTION <- predict(best.mod, testing[,2:length(testing)])
get_results(results)
Precision: 0.97 	Recall: 0.916 	Accuracy: 0.943 	F1: 0.943 

  

Decision Trees¶

In addition to support vector machines, I applied several decision tree-based algorithms for model development: classification tree, bagging, random forest, and boosting.

Classification Tree¶

Applied to the training data, the tree algorithm from package tree ultimately constructed a model using ten of the fifty features:

  • 'PAN_ACTION'- binary feature indicating the presence of "visit", "watch", "view", or "check" (with no morphological variation) in the comment text
  • 'MAX_WLEN'- (regularized) numeric feature indicating the character count of the longest space-separated string of alphanumeric characters (here, considered a "word") in the comment
  • 'HAS_LINK'- binary feature indicating the presence of "http:","https:", or "www."
  • 'POLITE'- binary feature indicating the presence of "thank you", "thanks", or "please"
  • 'CHANNEL'- binary feature indicating the presence of "channel", "channels", or spelling variation with a single "n"
  • 'LIKE'- binary feature indicating the presence of "like" with no variation
  • 'SUBSCRIBE'- binary feature indicating the presence of "subscription(s)", or "subscrib-"
  • 'LENGTH'- (regularized) numeric feature indicating the character count of the entire comment before any text processing or simplification
  • 'TITLE'- binary feature indicating the presence in the comment text of the song title for the video to which the comment is posted (e.g. "Gangnam Style" or just "Gangam" or "Style" mentioned in a comment made to Psy's "Gangnam Style" video)
  • 'VIEWS'- binary feature indicating the presence of "views"
In [8]:
require(tree)
treeSpam = tree(IS_SPAM ~., data = training)
summary(treeSpam)
Loading required package: tree
Classification tree:
tree(formula = IS_SPAM ~ ., data = training)
Variables actually used in tree construction:
 [1] "PAN_ACTION" "MAX_WLEN"   "HAS_LINK"   "POLITE"     "CHANNEL"   
 [6] "LIKE"       "SUBSCRIBE"  "LENGTH"     "TITLE"      "VIEWS"     
Number of terminal nodes:  13 
Residual mean deviance:  0.3611 = 525.4 / 1455 
Misclassification error rate: 0.0545 = 80 / 1468 
In [131]:
library(repr)
options(repr.plot.width = 8.5, repr.plot.height = 4)
plot(treeSpam, type="uniform")
text(treeSpam, pretty=0)

When the decision tree constructed, above, classifies a comment as spam, it is very likely to truly be spam, as indicated by the model precision, below (96.9%, on par with 97.0% for the SVMs with linear and radial kernels). However, there are clearly still spam comments of a pattern that isn't accounted for in the decision tree model given that it has lower recall than other models (87.6%, vs. 91.6% for the SVM with radial kernel).

In [46]:
results$PREDICTION = predict(treeSpam, testing[,2:length(testing)], type="class")
get_results(results)
Precision: 0.969 	Recall: 0.876 	Accuracy: 0.922 	F1: 0.921 

  

Decision Trees with Bagging¶

Using the randomForest package, bagging, the use of multiple boostrapped training sets using all of the features of the data, was applied using the randomForest function and spefying that all 50 of the feature variables be included in each tree drawn for training. The model resulting from training 1000 trees clearly benefitted from the boostrapped data, and shows less bias than did the simple decision tree model (note that the precision and recall here are roughly even, 93.3% and 93.6% respectively, whereas the precision was very high and recall relatively much lower for the simple decision tree modle, 97.0% and 87.6% respectively). Of particular note is the high recall of this model when applied to the test data: 93.6%. It is higher than the model's precision, and the highest recall rate of any of the models tested.

In [96]:
require(randomForest)
bagSpam = randomForest(IS_SPAM ~., data = training, mtry=50, ntree = 1000)
bagSpam
results$PREDICTION = predict(bagSpam, testing[,2:length(testing)], type="class")
get_results(results)
Call:
 randomForest(formula = IS_SPAM ~ ., data = training, mtry = 50,      ntree = 1000) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 50

        OOB estimate of  error rate: 4.7%
Confusion matrix:
    0   1 class.error
0 684  30  0.04201681
1  39 715  0.05172414
Precision: 0.933 	Recall: 0.936 	Accuracy: 0.932 	F1: 0.934 

  

Random Forest¶

For a random forest approach, bootsrapped resampling is once again applied, but the trees trained are constructed using only a subset of the features of the data (here, seven features per tree was used). After training on 1000 trees, the resulting model has good, generally balanced performance (both precision and recall are in the low-90's: 93.5% and 92.4%, respectively).

Also, an interesting observation can be made about this model's Out-of-Bag (OOB) error rate, or the rate of error of the model when validated over the training-internal held-out validation set. Despite being slightly lower than that of the the Bagging model (4.5% vs. 4.7%), the model has slightly lower accuracy when it is validated using the testing data (92.8% vs. 93.2% for Bagging). This, as well as the higher precision/lower recall of the Random Forest model compared to the Bagging model (Precision 93.5%, Recall 92.4% vs 93.3%, 93.6% for Bagging), indicate that the Random Forest model is subtly more biased than the Bagging model.

In [ ]:
ranForest = randomForest(IS_SPAM ~., data = training, mtry = round(sqrt(50), 0), ntree = 1000, importance = T)
results$PREDICTION = predict(ranForest, testing[,2:length(testing)], type="class")
ranForest
get_results(results)
Call:
 randomForest(formula = IS_SPAM ~ ., data = training, mtry = round(sqrt(50),      0), ntree = 1000, importance = T) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 7

        OOB estimate of  error rate: 4.5%
Confusion matrix:
    0   1 class.error
0 689  25  0.03501401
1  41 713  0.05437666
Precision: 0.935 	Recall: 0.924 	Accuracy: 0.928 	F1: 0.93 

  

The random forest method, however, has the advantage of providing variable importance measurements for the data's features. Dot plots with ranked variable importance can be created with the function varImpPlot. The left plot, below, lists the 30 features that demonstrate the greatest drops in model accuracy when missing from models tested over the training-internal validation set. Among the 30 features listed, we can see that all ten of those that had been selected by the simple decision tree algorithm ("HAS_LINK", "POLITE", "PAN_ACTION", "LENGTH", "MAX_WLEN", "SUBSCRIBE", "CHANNEL", "LIKE", "TITLE", and "VIEWS") are part of this top-30 list. However, they do not constitute the top 10 features of this ranked list, and are accompanied by other features, such as "PAN_YOUTUBE" (a binary feature indicating the presence of "subsciption(s)", "subscrib-", "chan(n)el(s)", or "youtu-") and "ALPHANUM_WC" (the regularized numeric feature indicating the number of space-separated alphanumeric character strings).

The second plot (right) ranks the features by a different measure of importance: the mean decrease in Gini index (i.e. the decrease in the impurity of the node as measured by deviance) when splitting occurs across the feature in question. From this ranking we can see that several of the features in particular are strong predictors for separating the training data into classes: "PAN_YOUTBE", "PAN_ACTION", "MAX_WLEN", "LENGTH", "HAS_LINK", and "CHECK" (the binary feature indicating the presence of "check" with no morphological variation).

In [30]:
#importance(ranForest)
options(repr.plot.width = 7, repr.plot.height = 6.25)
varImpPlot(ranForest)

Decision Trees with Boosting Decision¶

The last ML strategy I applied to the data was growing Decision Trees with Boosting (i.e. the additive growth the trees) using the generalized boosting model package gbm. I trained the model fitting 2,500 trees and using a minimum of 20 observations in terminal nodes and a shrinkage parameter of .01.

Of note, I first tried applying the generalized boosting modelling (gbm) function with the Bernoulli distribution, which was appropriate given the dichotomous outcome variable. However, I experienced some difficulty in getting predicted values when using the Bernoulli distribution option (a problem apparently experienced by some others, as well). Instead, I simply used the Gaussian distribution when training the model, and subsequently selected the appropriate cut-off value to convert the results to 0 (non-spam) or 1 (spam) from the numeric value output.

In [114]:
require(gbm)
#boostSpam <- gbm(IS_SPAM ~., distribution = "bernoulli",
##               data = training,
#               n.trees = 2500,
#               shrinkage = .01)
#boostSpam  # distribution = "bernoulli" not working

boostSpam <- gbm(formula = IS_SPAM ~ .,
               distribution = "gaussian",
               data = training,
               n.trees = 2500,
               shrinkage = .01,
               n.minobsinnode = 20) 
boostSpam
gbm(formula = IS_SPAM ~ ., distribution = "gaussian", data = training, 
    n.trees = 2500, n.minobsinnode = 20, shrinkage = 0.01)
A gradient boosted model with gaussian loss function.
2500 iterations were performed.
There were 50 predictors of which 45 had non-zero influence.

To find the most promising cut-off values I then plotted a Receiver Operating Characteristic (ROC) curve and a Precision/Recall curve using the ROCR package. Since, ideally, I wanted a model with precision and recall both over 90%, I looked for a cutoff value that fell between .9 and 1.0 on the Average True Positive Rate dimension on the ROC curve and between 0.9 and 1.0 on the Precision/Recall curve. The cutoff value at or just below 1.4 appears to provide the desired precision and recall.

In [129]:
library(ROCR)
results$PREDICTION <- predict(boostSpam, testing[,2:length(testing)], n.trees=2500)

options(repr.plot.width = 8, repr.plot.height = 4.5)
par(mfrow=c(1,2))
pred <- prediction(results$PREDICTION, results$IS_SPAM)

perf <- performance(pred, "tpr", "fpr") #ROC
plot(perf, colorize=T, print.cutoffs.at=seq(1,2,by=0.1), text.adj=c(-.25, .9), avg="threshold", lwd=3)

perf <- performance(pred, "prec", "rec") #precision/recall curve
plot(perf, colorize=T, print.cutoffs.at=seq(1,2,by=0.1), text.adj=c(1.2,1.2), avg="threshold", lwd=3)

To make the final decision on a cutoff value I then tested cutoff values from 1.4 to 1.35 in increments of .01. Ultimately I chose the cutoff of 1.38 as it offered the highest accuracy and F1 rates for the cutoffs tested, and since it had the highest rate of recall with the precision still above 90%.

In [130]:
results$PREDICTION_0 <- results$PREDICTION

results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.4, 1, 0)
get_results(results)

results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.39, 1, 0)
get_results(results)

results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.38, 1, 0)
get_results(results)

results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.37, 1, 0)
get_results(results)

results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.36, 1, 0)
get_results(results)

results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.35, 1, 0)
get_results(results)
Precision: 0.93 	Recall: 0.904 	Accuracy: 0.916 	F1: 0.917 

  
Precision: 0.923 	Recall: 0.912 	Accuracy: 0.916 	F1: 0.918 

  
Precision: 0.92 	Recall: 0.916 	Accuracy: 0.916 	F1: 0.918 

  
Precision: 0.913 	Recall: 0.916 	Accuracy: 0.912 	F1: 0.915 

  
Precision: 0.899 	Recall: 0.92 	Accuracy: 0.906 	F1: 0.909 

  
Precision: 0.893 	Recall: 0.928 	Accuracy: 0.906 	F1: 0.91 

  

Conclusion¶

All of the models tested had roughly comparable performance in terms of accuracy and F1 scores. All of the models had an accuracy rate beteween 90.0-94.5%, and all but the SVM with a polynomial kernel also had an F1 score between 90.0-94.5%. However, there was great variation between the precision and recall scores of the models. Precision ranged from 92.0-98.6%, and recall differed even more, ranging from 81.7% for the model from the SVM with a polynomial kernel to 93.6% for the Bagging model. For my purposes (a hypothetical spam filter), I would want a model that captures the widest variety of spam (and therefore exhibits high recall), but that doesn't overly mis-classify non-spam (so I would want a precision of 90% or greater). Luckily the model with the highest recall, that from bagging, fits the bill. While it doesn't have the highest accuracy or F1 scores, it has the highest recall at 93.6% and still has decent precision at 93.3%.

Model Precision Recall Accuracy F1
Linear Regression 0.962 0.908 0.934 0.934
SVM, linear 0.97 0.908 0.939 0.938
SVM, polynomial 0.986 0.817 0.90 0.893
SVM, radial 0.97 0.916 0.943 0.943
Classification Tree 0.969 0.876 0.922 0.921
Bagging 0.933 0.936 0.932 0.934
Random Forest 0.935 0.924 0.928 0.93
Boosting, 1.38 cutoff 0.92 0.916 0.916 0.918