Projects
- 12/2017 - Predicting Taxi Trip Distance: Zoning, Census, and Geographic Data Boosts Model Performance
- My project for Data Science 450. Using 2013 NYC taxi trip and fare data as well American Community Survey data and several map overalys, I created and optimized a model for predicting taxi trip distance given information a driver would have at the outset of a trip.
- 10/2017 - Optimizing YouTube Spam Detection with Machine Learning
- As an extension of my project for Data Science 350, Methods for Data Analysis I used the same set of contextually-informed features to build predictive models using several SVM and decision tree-based methods, and compare their results to the original logistic regression model with elastic net regularization.
- 9/2017 - Package Comparisons for Text Processing in R
- Looking for a go-to package for basic text processing functions in R, I compared the execution times for R's base package functions to those of regexPipes, stringi, and stringr, with and without piping, and with and without parallel processing.
- 8/2017 - Contextual Features vs. High-Frequency Tokens
Spam Detection in YouTube Comments - My project for Data Science 350, Methods for Data Analysis. Using a publically available data set of YouTube comments, I used statistical analysis to examine the data, and then built two competing models for predicting a comment's status as spam or non-spam using logistic regression and elastic net regularization. The model with context-informed features, or those based on the comments' typographic and broad lexical patterns as well as aspects of the videos to which the comment was posted, was shown to perform better than a model based on high-frequency non-stopword tokens alone.
- For more information on the linguistic science that inspired my choice of contextually-informed features, check out Steven Pinker's TED talk on language habits, Susan Herring's article on Grammar and Electronic Communication, or even the Wikipedia entry for Internet Linguistics.