In my project for Data Science 350b I demonstrated that a predictive model built using logistic regression with elastic net (L1 + L2) regularization and based on contextually-informed features performed better at classifying spam in YouTube comments than did a model based on an equivalent number of frequent non-stopword tokens. As an extension of that project, I subsequently applied several different machine learning algorithms to the contextually-informed feature set for the same training and testing data to identify the optimal performance that could be achieved given the choice of features developed. The machine learning algorithms used for the comparison were support vector machines with linear, polynomial, and radial kernels, a classification tree with and without pruning, random forest, and bagging. Ultimately it was revealed that the performance of the original regression model was on par with those of the machine learning models, and that the models with the best performance, depending on the priority of maximal precision or recall, were the models developed with the Support Vector Machine with a radial kernel or with Bagging.
For this project I used Tiago A. Almeida and Tulio C. Alberto's “YouTube Spam Collection” data set available from both the UC Irvine machine learning data sets and from Almeida and Alberto's website. The set contains 1,956 public comments posted to five popular music videos from the same general time period: “Gangnam Style” by Psy, “Roar” by Katy Perry, “Rock Party Anthem” by LMFAO, “Love The Way You Lie” by Eminem, and “Waka Waka (This Time for Africa)” by Shakira. In addition to the comments themselves, the set contains the comment IDs, the comments' authors, the date and time of posting, and the categorization of each comment as spam or non-spam (‘ham’) as made by Almeida and Alberto.
The set of 1,956 observations was split into a training and testing set by randomly assigning 75% of the observations to training (1,468 observations) and the remaining 25% to testing (488 observations).
nrow(training)
nrow(testing)
The features I selected for modeling were based on a number of patterns in the use of individual words, as well as classes of words, that I had observed in my exploration of the comment data. For example, in comparing the words that occured more frequently in one class of comments than the other I found that profanity ("damn", "fuck", etc.) occured much more frequently in non-spam and polite words ("thanks", "thank you", and "please") occured much more frequently in spam. Additionally, I had hypothesized that there would be salient differences in the typographic style of spam and non-spam comments, measured by the percentage of the comment that was alphabetic characters, numeric characters, or non-alphanumeric characters, as well as comment length in characters and word counts at various stages of text processing. For a more complete explanation of the features I selected, my rationale for selecting them, and the regular expressions I used to compute the features from the comment text, see the "Feature Building" section of my original project report.
The original model developed was a logistic regression model with elastic net regularization using the glmnet package. The model with 49 degrees of freedom (DoF) produces the preferred results. Though models with fewer degrees of freedom have slightly higher precision (97.0-97.4% vs. 96.2%), the model with 49 DoF has the highest recall of any of the models tested (90.8% vs. 90.4% or less). This is a valid trade-off given the already very high precision and relatively lower recall.
require(glmnet)
mod.mx <- model.matrix(IS_SPAM ~ ., data = training)
mod.mx <- mod.mx[,2:ncol(mod.mx)]
test.mod.mx <- model.matrix(IS_SPAM ~ ., data = testing)
test.mod.mx <- test.mod.mx[,2:ncol(test.mod.mx)]
mod.elnet = glmnet(mod.mx, training$IS_SPAM, family = 'binomial', nlambda = 20,
alpha = 0.5)
getMetrics <- function(dataFrame, modOut, modIn, l, src = "Enriched"){
dataFrame[,"PREDICTION"] <- predict(modOut, newx = modIn, type="class")[,l]
TP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "1",])
FP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "1",])
TN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "0",])
FN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "0",])
TPR <- TP/(TP+FN)
TNR <- TN/(TN+FP)
FNR <- 1-TPR
FPR <- 1-TNR
PPV <- TP/(TP+FP)
acc <- (TP+TN)/(TP+TN+FP+FN)
f1 <- (2*TP)/((2*TP)+FP+FN)
cat("Performance for Model from", src, "Features, lambda =", l,
", Degrees of Freedom =", mod.elnet$df[l],
"\nPrecision:", round(PPV,3), "\tRecall:", round(TPR,3), "\tAccuracy:",
round(acc,3), "\tF1:", round(f1,3), "\n\n", collapse = " ")}
getMetrics(testing, mod.elnet, test.mod.mx, l = 5)
getMetrics(testing, mod.elnet, test.mod.mx, l = 10)
getMetrics(testing, mod.elnet, test.mod.mx, l = 13)
getMetrics(testing, mod.elnet, test.mod.mx, l = 15) # use this
getMetrics(testing, mod.elnet, test.mod.mx, l = 18)
getMetrics(testing, mod.elnet, test.mod.mx, l = 20)
Using the e1071 package and its function tune, its built-in 10 fold cross-validation for model tuning, I then used a support vector machine with a linear kernel with the same training data to develop and compare models at seven different values of the cost parameter. Of the cost values tested (.001, .01, .1, 1, 5, 10, 100), the value 5 resulted in the smallest training error.
require(e1071)
tune.out <- tune(svm, IS_SPAM ~., data = training, kernel = "linear",
ranges = list(cost=c(.001, .01, .1, 1, 5, 10, 100)))
summary(tune.out)
The performance of the SVM with linear kernel is practically the same as that of the linear regression with regularization, as it has only slightly higher rates of Precision (97.0% vs 96.2%) and Accuracy (93.9% vs 93.4%), and has precisely the same Recall (90.8%).
best.mod <- tune.out$best.model
results <- testing %>% dplyr::select(IS_SPAM)
get_results <- function(dataFrame){
TP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "1",])
FP <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "1",])
TN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "0" & dataFrame[,"PREDICTION"] == "0",])
FN <- nrow(dataFrame[dataFrame[,"IS_SPAM"] == "1" & dataFrame[,"PREDICTION"] == "0",])
TPR <- TP/(TP+FN)
TNR <- TN/(TN+FP)
FNR <- 1-TPR
FPR <- 1-TNR
PPV <- TP/(TP+FP)
acc <- (TP+TN)/(TP+TN+FP+FN)
f1 <- (2*TP)/((2*TP)+FP+FN)
cat("\nPrecision:", round(PPV,3), "\tRecall:", round(TPR,3),
"\tAccuracy:", round(acc,3), "\tF1:", round(f1,3), "\n\n", collapse = " ")}
results$PREDICTION <- predict(best.mod, testing[,2:length(testing)])
get_results(results)
Applying an SVM with a polynomial kernel to the data results in a model with poorer performance than those from either the linear regression or SVM with a linear kernel. Though the SVM with polynomial kernel does result in model that classifies the spam of the test data with much greater precision than the previous models (98.6% vs. 96.2-97.0%), but under-classifies the spam, as indicated by the much lower recall (81.7% vs. 90.8%).
svm.poly <- tune(svm, IS_SPAM ~., data = training, kernel = "polynomial",
ranges = list(cost=c(.001, .01, .1, 1, 5, 10, 100)))
#summary(svm.poly)
best.mod <- svm.poly$best.model
results$PREDICTION <- predict(best.mod, testing[,2:length(testing)])
get_results(results)
The model developed with a SVM with a radial kernel, on the other hand, performs better than those from linear regression and the SVM with a linear kernel. Not only does the radial kernel SVM model match the high precision of these previous models (97.0%), it even provides a slight boost in recall (91.6% vs. 90.8%). Given that the recall lags behind the precision for these models in general, this is model provides definitely offers a slight advantage.
svm.radial <- tune(svm, IS_SPAM ~., data = training, kernel = "radial",
ranges = list(cost=c(.001, .01, .1, 1, 5, 10)))
#summary(svm.radial)
best.mod <- svm.radial$best.model
results$PREDICTION <- predict(best.mod, testing[,2:length(testing)])
get_results(results)
In addition to support vector machines, I applied several decision tree-based algorithms for model development: classification tree, bagging, random forest, and boosting.
Applied to the training data, the tree algorithm from package tree ultimately constructed a model using ten of the fifty features:
require(tree)
treeSpam = tree(IS_SPAM ~., data = training)
summary(treeSpam)
library(repr)
options(repr.plot.width = 8.5, repr.plot.height = 4)
plot(treeSpam, type="uniform")
text(treeSpam, pretty=0)
When the decision tree constructed, above, classifies a comment as spam, it is very likely to truly be spam, as indicated by the model precision, below (96.9%, on par with 97.0% for the SVMs with linear and radial kernels). However, there are clearly still spam comments of a pattern that isn't accounted for in the decision tree model given that it has lower recall than other models (87.6%, vs. 91.6% for the SVM with radial kernel).
results$PREDICTION = predict(treeSpam, testing[,2:length(testing)], type="class")
get_results(results)
Using the randomForest package, bagging, the use of multiple boostrapped training sets using all of the features of the data, was applied using the randomForest function and spefying that all 50 of the feature variables be included in each tree drawn for training. The model resulting from training 1000 trees clearly benefitted from the boostrapped data, and shows less bias than did the simple decision tree model (note that the precision and recall here are roughly even, 93.3% and 93.6% respectively, whereas the precision was very high and recall relatively much lower for the simple decision tree modle, 97.0% and 87.6% respectively). Of particular note is the high recall of this model when applied to the test data: 93.6%. It is higher than the model's precision, and the highest recall rate of any of the models tested.
require(randomForest)
bagSpam = randomForest(IS_SPAM ~., data = training, mtry=50, ntree = 1000)
bagSpam
results$PREDICTION = predict(bagSpam, testing[,2:length(testing)], type="class")
get_results(results)
For a random forest approach, bootsrapped resampling is once again applied, but the trees trained are constructed using only a subset of the features of the data (here, seven features per tree was used). After training on 1000 trees, the resulting model has good, generally balanced performance (both precision and recall are in the low-90's: 93.5% and 92.4%, respectively).
Also, an interesting observation can be made about this model's Out-of-Bag (OOB) error rate, or the rate of error of the model when validated over the training-internal held-out validation set. Despite being slightly lower than that of the the Bagging model (4.5% vs. 4.7%), the model has slightly lower accuracy when it is validated using the testing data (92.8% vs. 93.2% for Bagging). This, as well as the higher precision/lower recall of the Random Forest model compared to the Bagging model (Precision 93.5%, Recall 92.4% vs 93.3%, 93.6% for Bagging), indicate that the Random Forest model is subtly more biased than the Bagging model.
ranForest = randomForest(IS_SPAM ~., data = training, mtry = round(sqrt(50), 0), ntree = 1000, importance = T)
results$PREDICTION = predict(ranForest, testing[,2:length(testing)], type="class")
ranForest
get_results(results)
The random forest method, however, has the advantage of providing variable importance measurements for the data's features. Dot plots with ranked variable importance can be created with the function varImpPlot. The left plot, below, lists the 30 features that demonstrate the greatest drops in model accuracy when missing from models tested over the training-internal validation set. Among the 30 features listed, we can see that all ten of those that had been selected by the simple decision tree algorithm ("HAS_LINK", "POLITE", "PAN_ACTION", "LENGTH", "MAX_WLEN", "SUBSCRIBE", "CHANNEL", "LIKE", "TITLE", and "VIEWS") are part of this top-30 list. However, they do not constitute the top 10 features of this ranked list, and are accompanied by other features, such as "PAN_YOUTUBE" (a binary feature indicating the presence of "subsciption(s)", "subscrib-", "chan(n)el(s)", or "youtu-") and "ALPHANUM_WC" (the regularized numeric feature indicating the number of space-separated alphanumeric character strings).
The second plot (right) ranks the features by a different measure of importance: the mean decrease in Gini index (i.e. the decrease in the impurity of the node as measured by deviance) when splitting occurs across the feature in question. From this ranking we can see that several of the features in particular are strong predictors for separating the training data into classes: "PAN_YOUTBE", "PAN_ACTION", "MAX_WLEN", "LENGTH", "HAS_LINK", and "CHECK" (the binary feature indicating the presence of "check" with no morphological variation).
#importance(ranForest)
options(repr.plot.width = 7, repr.plot.height = 6.25)
varImpPlot(ranForest)
The last ML strategy I applied to the data was growing Decision Trees with Boosting (i.e. the additive growth the trees) using the generalized boosting model package gbm. I trained the model fitting 2,500 trees and using a minimum of 20 observations in terminal nodes and a shrinkage parameter of .01.
Of note, I first tried applying the generalized boosting modelling (gbm) function with the Bernoulli distribution, which was appropriate given the dichotomous outcome variable. However, I experienced some difficulty in getting predicted values when using the Bernoulli distribution option (a problem apparently experienced by some others, as well). Instead, I simply used the Gaussian distribution when training the model, and subsequently selected the appropriate cut-off value to convert the results to 0 (non-spam) or 1 (spam) from the numeric value output.
require(gbm)
#boostSpam <- gbm(IS_SPAM ~., distribution = "bernoulli",
## data = training,
# n.trees = 2500,
# shrinkage = .01)
#boostSpam # distribution = "bernoulli" not working
boostSpam <- gbm(formula = IS_SPAM ~ .,
distribution = "gaussian",
data = training,
n.trees = 2500,
shrinkage = .01,
n.minobsinnode = 20)
boostSpam
To find the most promising cut-off values I then plotted a Receiver Operating Characteristic (ROC) curve and a Precision/Recall curve using the ROCR package. Since, ideally, I wanted a model with precision and recall both over 90%, I looked for a cutoff value that fell between .9 and 1.0 on the Average True Positive Rate dimension on the ROC curve and between 0.9 and 1.0 on the Precision/Recall curve. The cutoff value at or just below 1.4 appears to provide the desired precision and recall.
library(ROCR)
results$PREDICTION <- predict(boostSpam, testing[,2:length(testing)], n.trees=2500)
options(repr.plot.width = 8, repr.plot.height = 4.5)
par(mfrow=c(1,2))
pred <- prediction(results$PREDICTION, results$IS_SPAM)
perf <- performance(pred, "tpr", "fpr") #ROC
plot(perf, colorize=T, print.cutoffs.at=seq(1,2,by=0.1), text.adj=c(-.25, .9), avg="threshold", lwd=3)
perf <- performance(pred, "prec", "rec") #precision/recall curve
plot(perf, colorize=T, print.cutoffs.at=seq(1,2,by=0.1), text.adj=c(1.2,1.2), avg="threshold", lwd=3)
To make the final decision on a cutoff value I then tested cutoff values from 1.4 to 1.35 in increments of .01. Ultimately I chose the cutoff of 1.38 as it offered the highest accuracy and F1 rates for the cutoffs tested, and since it had the highest rate of recall with the precision still above 90%.
results$PREDICTION_0 <- results$PREDICTION
results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.4, 1, 0)
get_results(results)
results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.39, 1, 0)
get_results(results)
results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.38, 1, 0)
get_results(results)
results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.37, 1, 0)
get_results(results)
results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.36, 1, 0)
get_results(results)
results$PREDICTION <- ifelse(results$PREDICTION_0 > 1.35, 1, 0)
get_results(results)
All of the models tested had roughly comparable performance in terms of accuracy and F1 scores. All of the models had an accuracy rate beteween 90.0-94.5%, and all but the SVM with a polynomial kernel also had an F1 score between 90.0-94.5%. However, there was great variation between the precision and recall scores of the models. Precision ranged from 92.0-98.6%, and recall differed even more, ranging from 81.7% for the model from the SVM with a polynomial kernel to 93.6% for the Bagging model. For my purposes (a hypothetical spam filter), I would want a model that captures the widest variety of spam (and therefore exhibits high recall), but that doesn't overly mis-classify non-spam (so I would want a precision of 90% or greater). Luckily the model with the highest recall, that from bagging, fits the bill. While it doesn't have the highest accuracy or F1 scores, it has the highest recall at 93.6% and still has decent precision at 93.3%.
Model | Precision | Recall | Accuracy | F1 |
---|---|---|---|---|
Linear Regression | 0.962 | 0.908 | 0.934 | 0.934 |
SVM, linear | 0.97 | 0.908 | 0.939 | 0.938 |
SVM, polynomial | 0.986 | 0.817 | 0.90 | 0.893 |
SVM, radial | 0.97 | 0.916 | 0.943 | 0.943 |
Classification Tree | 0.969 | 0.876 | 0.922 | 0.921 |
Bagging | 0.933 | 0.936 | 0.932 | 0.934 |
Random Forest | 0.935 | 0.924 | 0.928 | 0.93 |
Boosting, 1.38 cutoff | 0.92 | 0.916 | 0.916 | 0.918 |