# Predicting Social Media Engagement At the below site you may find an Rpubs presentation of a project which predicts clicks on a Facebook post. After comparing multiple machine learning Algorithms, we find that eXtreme Gradient boosting works well on predicting the number of clicks, with a RMSE of ~0.58.

https://rpubs.com/njrod8/stat510

The code which generated the presentation is as follows:

```Predicting Social Media Engagement
========================================================
author: Nathan Rodriguez
date: April 23, 2018
autosize: true

<style>
.small-code pre code {
font-size: 1em;
}
</style>

Study/Objective
========================================================

* Data from a Cosmetic company's Facebook campaigns
+ 10 independent varaibles
+ 1,632 campaigns.
* Goal: Acheive low RMSE for predicting Facebook post clicks

Methods/Tools
========================================================
class: small-code
* Methods
+ Boruta - Feature Selection
+ Linear regression
+ Random Forest
+ RPART
+ XG Boost
* Tools
+ R
```{r}
library(tidyverse)
library(Boruta)
library(caret)
library(DescTools)
library(ggplot2)
```
```{r echo=FALSE}
# function to combine multiple plots
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)

# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)

numPlots = length(plots)

# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout<- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}

if (numPlots==1) {
print(plots[])

} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx<- as.data.frame(which(layout == i, arr.ind = TRUE))

print(plots[[i]], vp = viewport(layout.pos.row = matchidx\$row,
layout.pos.col = matchidx\$col))
}
}
}

```

Data
========================================================
class: small-code
```{r}
```
```{r echo=FALSE}
data %>% mutate_if(is.character, as.factor) -> data
levels(data\$Image_URL)[levels(data\$Image_URL)=="NULL"]<- "None"

```

Exploratory Data Analysis
========================================================
class: small-code
```{r fig1, fig.width = 18, fig.align = "center", echo=FALSE}
g<- ggplot(data, aes(Clicks)) + geom_density() +theme_minimal()
g1<-ggplot(data, aes(Ad_Type)) + geom_bar(stat = "count") + theme_minimal()
g2<-ggplot(data, aes(Keyword_Category)) + geom_bar(stat = "count") + theme_minimal()
multiplot(g, g1, g2,  cols=3)
```

Exploratory Data Analysis
========================================================
class: small-code
```{r fig2, fig.width = 18, fig.height=6, fig.align = "center", echo=FALSE}
g3<-ggplot(data, aes(Image_URL)) + geom_bar(stat = "count") + theme_minimal() +
theme(axis.text.x=element_text(angle=70,hjust=1))
g4<-ggplot(data, aes(Body)) + geom_bar(stat = "count") + theme_minimal() +
theme(axis.text.x=element_text(angle=70,hjust=1))
g5<-ggplot(data, aes(Keywords)) + geom_bar(stat = "count") + theme_minimal() +
theme(axis.text.x=element_text(angle=70,hjust=1))
g6<-ggplot(data, aes(Link_URL)) + geom_bar(stat = "count") + theme_minimal() +
theme(axis.text.x=element_text(angle=70,hjust=1))
multiplot(g3, g4, g5, g6, cols=4)
```

Exploratory Data Analysis
========================================================
class: small-code
```{r fig3, fig.width = 18, fig.align = "center", echo=FALSE}
g9<-ggplot(data, aes(y=Clicks, x=Age_Min)) + geom_bar(stat="identity")
g10<-ggplot(data, aes(y=Spent, x=Age_Min)) + geom_bar(stat="identity")
multiplot(g7, g9, g10, cols = 3)
```

Feature Selection
========================================================
* Boruta: A wrapper built around the random forest algorithm. It tries to capture all the important, interesting features you might have in your dataset with respect to an outcome variable.

Boruta
========================================================
+ Create duplicate copies of all independent variables.
+ Shuffle the values of duplicated copies. Shadow variables.
+ Append shadow variables to original data.
+ Run random forest regression on the combined dataset. Performs a variable importance measure.
+ Compute Z score. (MZSA)
+ Tag the variables as 'unimportant' when significantly lower than MZSA.
+ Remove them from the process.
+ Tag variable as 'important' when significantly higher than MZSA.
+ Repeat for predefined number of iterations or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.

Feature Selection
========================================================
class: small-code
```{r}
set.seed(111)
boruta_train <- Boruta(Clicks~., data = data, doTrace = 2)
```
```{r fig4, fig.width = 14, fig.align = "center", echo=FALSE}
data <- data %>%
select(Ad_Type, Body, Image_URL, Keywords, Age_Min, Spent, Clicks)
plot(boruta_train, cex.axis=.7, las=2, xlab="", main="Variable Importance")
```

Split Data
========================================================
class: small-code

* Usually we are interested in how well the machine learning algorithm performs on data that it has not seen before, since this determines how well it will work when deployed in the real world. We therefore evaluate these performance measures using a test set of data that is separate from the data used for training the machine learning system.
```{r}
trainIndex <- createDataPartition(data\$Clicks, p=0.7, list=FALSE, times=1)
train<- data[trainIndex,]
test<- data[-trainIndex,]
```

Control Parameters
========================================================
class: small-code
```{r}
train_control<- trainControl(method="cv", number=5, repeats=3, savePredictions = TRUE)
```
* 10-fold cross validation
* repeat process 3 times

Linear Model
========================================================
class: small-code
* Objective: find parameter estimates which minimize SSE.
+ linearity assumtpions.
```{r}
lm_model<- train(Clicks ~ Spent+Image_URL+Keywords+Age_Min,
data=train, trControl=train_control, method="lm")
lm_model\$results
RMSE(predict(lm_model,test), test\$Clicks)
```

Recursive PARTitioning
========================================================
class: small-code
* The rpart programs builds regression models of a very general structure
using a two stage procedure:
+ first the single variable is found which best splits the data into two groups. The data is separated.
+ then this process is applied separately to each sub-group, and so on recursively until the
subgroups either reach a minimum size (5 for this data) or until no improvement can be
```{r}
data = data, method = 'rpart', trControl = train_control)
rpart_model\$results
```

Recursive PARTitioning
========================================================
class: small-code
```{r}
RMSE(predict(rpart_model,test), test\$Clicks)
```
```{r, fig7, fig.align="center", echo=FALSE}
library(rpart.plot)
rpart.plot(rpart_new)
```

Random Forest
========================================================
class: small-code
*  An ensemble learning method that constructs a multitude of decision trees at training time and outputs the mean prediction of the individual trees.

```{r}
rf_model<- train(Clicks ~ Ad_Type + Spent + Body + Image_URL + Keywords + Age_Min,
data = data, method="rf", trControl=train_control)
rf_model\$results
```

Random Forest
========================================================
class: small-code
```{r fig5, fig.align="center"}
RMSE(predict(rf_model,test), test\$Clicks)
```
```{r echo=FALSE}
plot(rf_model)
```

========================================================
* Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
+ xgboost : shallow trees, high bias, low variance.
+ Random forest: fully grown decision trees, low bias, high variance

========================================================
class: small-code
```{r}
xgbLinear_model <- train(Clicks ~ Ad_Type*Spent + Body + Image_URL + Keywords + Age_Min,
data=data, method = 'xgbLinear', trControl = train_control)
RMSE(predict(xgbLinear_model,test), test\$Clicks)

```

Summary
=======================================================

* Objective: Acheive high accuracy in predicting clicks for a Facebook Campaign.
* Best performing algorithm: eXtreme Gradient Boosting, RMSE = 0.5386003
+ Model: Clicks ~ Ad_Type * Spent + Body + Image_URL + Keywords + Age_Min

Thank You
=======================================================

![Alt Text](https://uproxx.files.wordpress.com/2011/07/chang-bestexam.gif)

References
=======================================================
* Applied Predictive Modeling, Max Kuhn
* Deep Learning, Ian Goofellow```