This post goes through a binary classification problem with Python's machine learning library scikit-learn.

## Aim

Create a model that predicts who is going to leave the organisation next. Commonly known as churn modelling.

To follow along, I breakdown each piece of the coding journey in this post. Alternatively, you can find a complete copy of the code on github. You'll need the following packages loaded:

```
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from random import sample
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import cross_val_score
from sklearn import metrics
from IPython.display import Image
from pydotplus import graph_from_dot_data
```

The dataset for this exercise was found on kaggle. Once unzipped, I read in the data:

```
# Set working directory
path = os.path.expanduser('~/Projects/kaggle_HR/')
os.chdir(path)
# Read in the data
df = pd.read_csv('HR_comma_sep.csv')
```

It contains data of 14,999 employees who are either in the organisation or have left, and 10 columns.

```
>>> df.shape
(14999, 10)
>>> df.columns.tolist()
['satisfaction_level',
'last_evaluation',
'number_project',
'average_montly_hours',
'time_spend_company',
'Work_accident',
'left',
'promotion_last_5years',
'sales',
'salary']
```

We need to get some sense of how balanced our dataset is...

```
# Some basic stats on the target variable
print '# people left = {}'.format(len(df[df['left'] == 1]))
# people left = 3571
print '# people stayed = {}'.format(len(df[df['left'] == 0]))
# people stayed = 11428
print '% people left = {}%'.format(round(float(len(df[df['left'] == 1])) / len(df) * 100), 3)
# % people left = 24.0%
```

Knowing that 76% of the workforce will stay helps us not be excited by machine learning diagnostics that aren't much better than 76%. If I predicted that **all** staff would remain in the organisation, I would be 76% accurate!

I next check for missing values.

```
>>> df.apply(lambda x: sum(x.isnull()), axis=0)
satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
sales 0
salary 0
```

No missing values, that's great. Next I look for correlations in the data.

```
# Correlation heatmap
correlation = df.corr()
plt.figure(figsize=(10, 10))
sns.heatmap(correlation, vmax=1, square=True, annot=True, cmap='cubehelix')
```

Unsurprisingly, `satisfaction_level`

has the largest correlation with the decision to stay or leave. Next up, pairwise plots provide a lot information for one diagram.

```
# take a 5% sample as this is computationally expensive
df_sample = df.sample(frac=0.05)
# Pairwise plots
pplot = sns.pairplot(df_sample, hue="left")
```

On the diagonal we can see the distribution of each variable, including the target variable of who has left (this visualises the balance). There are some interesting groupings of those who leave (Green dots) when looking at the plots of `satisfaction_level`

by `last_evaluation`

and by `average_montly_hours`

- they look to be the same sub-populations. E.g. Low satisfaction but high evaluation and monthly hours, there's also a grouping of a high number of projects completed with low satisfaction levels by those who leave. Perhaps these are *high performers* that easily leave when they are not satisfied by their job. There are other interesting groupings so it's worth spending some looking at this output examining these relationships. It's great to have these groupings, as our classifier should be able to differentiate between those who leave and those that remain. If there were no relationships revealed in this plot, we would face a very difficult problem and perhaps have to go back to examine what data we could collect to help predict the outcome.

Two variables were left out of the above plots, `Salary`

and `Sales`

. I need to deal with these separately prior to modelling, as our machine learning classifier will expect numbers to crunch rather than text.

```
>>> df.salary.unique()
array(['low', 'medium', 'high'], dtype=object)
>>> df.sales.unique()
array(['sales', 'accounting', 'hr', 'technical', 'support', 'management', 'IT', 'product_mng', 'marketing', 'RandD'], dtype=object)
```

`Salary`

has 3 values *low*, *medium* and *high*. These factor variables have an order. I'm going to go use dummy encoding as a simple and common method to handle factor variables. The downsides to this is I am not handling the *ordering* and will introduce extra dimensions to the data (see Curse of dimensionality for why this is bad). Helmert, Forward/Backward differencing or Orthogonal encoding would *probably* be better.

#### As an aside...

`R`

makes factors really easy to handle. For example, let's say we have data with an unordered factor variable `df$unordered.variable`

:

```
# Order a factor variable in R
df$ordered.var <- factor(df$unordered.var
, levels = c('low', 'medium', 'high')
```

## Dummy Encoding

Dummy encoding, or *one hot encoding*, transforms categorical variables into a series of binary columns. Before I one hot encode the `sales`

and `salary`

I prepend the column names to the categories, that way I know later which column each new column came from. This helps particularly in cases where the columns use the same category names e.g. 'high' could apply to `sales`

and `salary`

.

```
# Prepend string prior to encoding
df['salary'] = '$_' + df['salary'].astype(str)
df['sales'] = 'sales_' + df['sales'].astype(str)
# Create 'salary' dummies and join
one_hot_salary = pd.get_dummies(df['salary'])
df = df.join(one_hot_salary)
# Create 'sales' dummies and join
one_hot_sales = pd.get_dummies(df['sales'])
df = df.join(one_hot_sales)
```

If we proceeded to modelling without dropping columns, we would run into the *dummy variable trap*. To illustrate what the dummy variable trap is, let's say that our data has a column for `Gender`

in the form `['male','female','male','female'...]`

. If we use `pd.get_dummies()`

```
# Generate 'Gender' variable
Gender = np.random.choice(['male','female'], 4)
# Create DataFrame
df1 = pd.DataFrame(Gender, columns=['Gender'])
# One hot encode Gender
dummies = pd.get_dummies(df1['Gender'])
# Join dummies to 'df1'
df1 = df1.join(dummies)
print df1
Gender female male
0 male 0 1
1 female 1 0
2 male 0 1
3 female 1 0
```

We can see that the dummy columns of `female`

and `male`

reflect the `Gender`

column, a row with column `Gender='male'`

has `female=0`

and `male=1`

. In modelling, we are looking for *independent* variables to predict a *dependent* outcome. One of our dummy variables is completely *dependent* on the other (see Multicollinearity). We can naturally drop one of the dummy variables to avoid this trap whilst not losing any valuable information.

To avoid the dummy variable trap in the Kaggle HR dataset, I'm dropping the columns, `sales_IT`

and low salary `$_low`

.

```
# Drop unnecessary columns to avoid the dummy variable trap
df = df.drop(['salary', 'sales', '$_low', 'sales_IT'], axis=1)
```

## Split Into Test and Training Sets

Now to split the dataset up into our training, testing and *optionally* our validation sets. This is best done randomly to avoid having one subgroup of the data overrepresented in either our training or testing datasets, hence `df.sample()`

.

```
# Randomly, split the data into test/training/validation sets
train, test, validate = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
print train.shape, test.shape, validate.shape
# (8999, 20) (3000, 20) (3000, 20)
# Separate target and predictors
y_train = train['left']
x_train = train.drop(['left'], axis=1)
y_test = test['left']
x_test = test.drop(['left'], axis=1)
y_validate = validate['left']
x_validate = validate.drop(['left'], axis=1)
# Check the balance of the splits on y_
>>> y_test.mean()
0.23933333333333334
>>> y_train.mean()
0.24091565729525502
```

Feature importance can be ascertained from a random forest, alternatively Principle Components Analysis (PCA) can be used.

```
# Variable importance
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), x_train), reverse=True)
#[(0.3523, 'satisfaction_level'), (0.1738, 'time_spend_company'), (0.1705, 'number_project'), (0.158, 'average_montly_hours'), (0.1043, 'last_evaluation'), (0.0077, '$_high'), (0.0065, 'Work_accident'), (0.0059, '$_medium'), (0.0042, 'sales_technical'), (0.0034, 'sales_sales'), (0.0028, 'sales_support'), (0.0024, 'sales_hr'), (0.0023, 'sales_accounting'), (0.0019, 'sales_management'), (0.0014, 'sales_RandD'), (0.001, 'sales_product_mng'), (0.001, 'sales_marketing'), (0.0006, 'promotion_last_5years')]
```

A very rough plot of `importances`

reveals that past the first 5 columns, we really don't get that much more variance explained by the additional columns.

From hereon in I'll only use the first 5 variables.

```
# Create variable lists and drop
all_vars = x_train.columns.tolist()
top_5_vars = ['satisfaction_level', 'number_project', 'time_spend_company', 'average_montly_hours', 'last_evaluation']
bottom_vars = [cols for cols in all_vars if cols not in top_5_vars]
# Drop less important variables leaving the top_5
x_train = x_train.drop(bottom_vars, axis=1)
x_test = x_test.drop(bottom_vars, axis=1)
x_validate = x_validate.drop(bottom_vars, axis=1)
```

## Model Evaluation Basics

The formula for **accuracy** is:

**TP**is the number of true positives*Predicted as leaving and they do***TN**is the number of true negatives*Predicted as remaining and they do***FP**is the number of false positives*Predicted as leaving, but they aren't - whoops!***FN**is the number of false negatives*Predicted as remaining and they don't - eek!*

As mentioned before, if we were aiming to predict those who remain (76%) then if we predicted 100%, we would have `(0 + 11428) / 14999`

or 76% accurracy. As the Accuracy Paradox wiki states precision and recall are probably more appropriate evaluation metrics. Here's a really quick definition of sensitivity (AKA recall), specificity and precision:

**Sensitivity/Recall**-`TP / (TP + FN)`

how well the model*recalls*/identifies those that will leave. AKA the true positive rate.**Specificity**-`TN / (TN + FP)`

how well the model identifies those that will stay.**Precision**`TP / (TP + FP)`

how believable is the model? A low precision model will alarm you to those who are leaving that are actually staying.**F1 score**`2 * (precision * recall)/(precision + recall)`

is the harmonic mean betwen precision and recall or the*balance*.

For this problem, we are perhaps most interested in knowing who is going to leave next. That way an organisation can respond with workforce planning and recruitment activities. Therefore, precision and recall will be the metrics we focus on.

Intuitively, precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples.

Source: scikit-learn

## Modelling

### 1. Logistic Regression

Starting off with a very simple model, logistic regression.

```
# Instantiate
logit_model = LogisticRegression()
# Fit
logit_model = logit_model.fit(x_train, y_train)
# How accurate?
logit_model.score(x_train, y_train)
#0.7874
# How does it perform on the test dataset?
# Predictions on the test dataset
predicted = pd.DataFrame(logit_model.predict(x_test))
# Probabilities on the test dataset
probs = pd.DataFrame(logit_model.predict_proba(x_test))
print metrics.accuracy_score(y_test, predicted)
0.785
```

Now let's look at the confusion matrix and the metrics we care about.

```
>>> print metrics.confusion_matrix(y_test, predicted)
[[2100 182]
[ 463 255]]
>>> print metrics.classification_report(y_test, predicted)
precision recall f1-score support
0 0.82 0.92 0.87 2282
1 0.58 0.36 0.44 718
avg / total 0.76 0.79 0.77 3000
```

Let's double check the `classification_report`

. Focusing on those who will leave.

Precision = TP/(TP+FP) = 255/(255+182) = 0.58

Recall = TP/(TP+FN) = 255/(255+463) = 0.36

Perhaps we could optimise this model, however there are a range of models to test that may perform better out-of-the-box. I'll quickly run through a range of models...

### 2. Decision Tree

Decision trees are highly interpretable and tend to perform well on classification problems.

```
# Instantiate with a max depth of 3
tree_model = tree.DecisionTreeClassifier(max_depth=3)
# Fit a decision tree
tree_model = tree_model.fit(x_train, y_train)
# Training accuracy
tree_model.score(x_train, y_train)
# Predictions/probs on the test dataset
predicted = pd.DataFrame(tree_model.predict(x_test))
probs = pd.DataFrame(tree_model.predict_proba(x_test))
# Store metrics
tree_accuracy = metrics.accuracy_score(y_test, predicted)
tree_roc_auc = metrics.roc_auc_score(y_test, probs[1])
tree_confus_matrix = metrics.confusion_matrix(y_test, predicted)
tree_classification_report = metrics.classification_report(y_test, predicted)
tree_precision = metrics.precision_score(y_test, predicted, pos_label=1)
tree_recall = metrics.recall_score(y_test, predicted, pos_label=1)
tree_f1 = metrics.f1_score(y_test, predicted, pos_label=1)
# evaluate the model using 10-fold cross-validation
tree_cv_scores = cross_val_score(tree.DecisionTreeClassifier(max_depth=3), x_test, y_test, scoring='precision', cv=10)
# output decision plot
dot_data = tree.export_graphviz(tree_model, out_file=None,
feature_names=x_test.columns.tolist(),
class_names=['remain', 'left'],
filled=True, rounded=True,
special_characters=True)
graph = graph_from_dot_data(dot_data)
graph.write_png("images/decision_tree.png")
```

### 3. Random Forest

```
# Instantiate
rf = RandomForestClassifier()
# Fit
rf_model = rf.fit(x_train, y_train)
# training accuracy 99.74%
rf_model.score(x_train, y_train)
# Predictions/probs on the test dataset
predicted = pd.DataFrame(rf_model.predict(x_test))
probs = pd.DataFrame(rf_model.predict_proba(x_test))
# Store metrics
rf_accuracy = metrics.accuracy_score(y_test, predicted)
rf_roc_auc = metrics.roc_auc_score(y_test, probs[1])
rf_confus_matrix = metrics.confusion_matrix(y_test, predicted)
rf_classification_report = metrics.classification_report(y_test, predicted)
rf_precision = metrics.precision_score(y_test, predicted, pos_label=1)
rf_recall = metrics.recall_score(y_test, predicted, pos_label=1)
rf_f1 = metrics.f1_score(y_test, predicted, pos_label=1)
# Evaluate the model using 10-fold cross-validation
rf_cv_scores = cross_val_score(RandomForestClassifier(), x_test, y_test, scoring='precision', cv=10)
rf_cv_mean = np.mean(rf_cv_scores)
```

This has performed possibly a little too well!

### 4. SUPPORT VECTOR MACHINE

```
# Instantiate
svm_model = SVC(probability=True)
# Fit
svm_model = svm_model.fit(x_train, y_train)
# Accuracy
svm_model.score(x_train, y_train)
# Predictions/probs on the test dataset
predicted = pd.DataFrame(svm_model.predict(x_test))
probs = pd.DataFrame(svm_model.predict_proba(x_test))
# Store metrics
svm_accuracy = metrics.accuracy_score(y_test, predicted)
svm_roc_auc = metrics.roc_auc_score(y_test, probs[1])
svm_confus_matrix = metrics.confusion_matrix(y_test, predicted)
svm_classification_report = metrics.classification_report(y_test, predicted)
svm_precision = metrics.precision_score(y_test, predicted, pos_label=1)
svm_recall = metrics.recall_score(y_test, predicted, pos_label=1)
svm_f1 = metrics.f1_score(y_test, predicted, pos_label=1)
# Evaluate the model using 10-fold cross-validation
svm_cv_scores = cross_val_score(SVC(probability=True), x_test, y_test, scoring='precision', cv=10)
svm_cv_mean = np.mean(svm_cv_scores)
```

### 5. KNN

```
# instantiate learning model (k = 3)
knn_model = KNeighborsClassifier(n_neighbors=3)
# fit the model
knn_model.fit(x_train, y_train)
# Accuracy
knn_model.score(x_train, y_train)
# Predictions/probs on the test dataset
predicted = pd.DataFrame(knn_model.predict(x_test))
probs = pd.DataFrame(knn_model.predict_proba(x_test))
# Store metrics
knn_accuracy = metrics.accuracy_score(y_test, predicted)
knn_roc_auc = metrics.roc_auc_score(y_test, probs[1])
knn_confus_matrix = metrics.confusion_matrix(y_test, predicted)
knn_classification_report = metrics.classification_report(y_test, predicted)
knn_precision = metrics.precision_score(y_test, predicted, pos_label=1)
knn_recall = metrics.recall_score(y_test, predicted, pos_label=1)
knn_f1 = metrics.f1_score(y_test, predicted, pos_label=1)
# Evaluate the model using 10-fold cross-validation
knn_cv_scores = cross_val_score(KNeighborsClassifier(n_neighbors=3), x_test, y_test, scoring='precision', cv=10)
knn_cv_mean = np.mean(knn_cv_scores)
```

### 6. TWO CLASS BAYES

```
# Instantiate
bayes_model = GaussianNB()
# Fit the model
bayes_model.fit(x_train, y_train)
# Accuracy
bayes_model.score(x_train, y_train)
# Predictions/probs on the test dataset
predicted = pd.DataFrame(bayes_model.predict(x_test))
probs = pd.DataFrame(bayes_model.predict_proba(x_test))
# Store metrics
bayes_accuracy = metrics.accuracy_score(y_test, predicted)
bayes_roc_auc = metrics.roc_auc_score(y_test, probs[1])
bayes_confus_matrix = metrics.confusion_matrix(y_test, predicted)
bayes_classification_report = metrics.classification_report(y_test, predicted)
bayes_precision = metrics.precision_score(y_test, predicted, pos_label=1)
bayes_recall = metrics.recall_score(y_test, predicted, pos_label=1)
bayes_f1 = metrics.f1_score(y_test, predicted, pos_label=1)
# Evaluate the model using 10-fold cross-validation
bayes_cv_scores = cross_val_score(KNeighborsClassifier(n_neighbors=3), x_test, y_test, scoring='precision', cv=10)
bayes_cv_mean = np.mean(bayes_cv_scores)
```

## Results

```
# Model comparison
models = pd.DataFrame({
'Model': ['Logistic', 'd.Tree', 'r.f.', 'SVM', 'kNN', 'Bayes'],
'Accuracy' : [logit_accuracy, tree_accuracy, rf_accuracy, svm_accuracy, knn_accuracy, bayes_accuracy],
'Precision': [logit_precision, tree_precision, rf_precision, svm_precision, knn_precision, bayes_precision],
'recall' : [logit_recall, tree_recall, rf_recall, svm_recall, knn_recall, bayes_recall],
'F1' : [logit_f1, tree_f1, rf_f1, svm_f1, knn_f1, bayes_f1],
'cv_precision' : [logit_cv_mean, tree_cv_mean, rf_cv_mean, svm_cv_mean, knn_cv_mean, bayes_cv_mean]
})
# Print table and sort by test precision
models.sort_values(by='Precision', ascending=False)
Model F1 Accuracy Precision cv_precision recall
r.f. 0.976812 0.989333 0.994100 0.988294 0.960114
d.Tree 0.915088 0.959667 0.901798 0.899632 0.928775
SVM 0.902098 0.953333 0.885989 0.863300 0.918803
kNN 0.902959 0.953000 0.873502 0.831590 0.934473
Bayes 0.530179 0.808000 0.620229 0.831590 0.462963
Logistic 0.327684 0.762000 0.483333 0.526624 0.247863
```

From the above results table, it is clear the Random Forest is the best model based on Accuracy, Precision and Recall metrics. The performance is so good, that I would be concerned about overfitting and a lack of generalisation to future periods of data. However, the classifier was evaluated with K-folds cross-validation (K=10) and also evaluated on the test dataset.

To use the trained model later, it's first best practice to re-train on the entire dataset to get the best trained model. Then save the pickle i.e. serialised model.

```
# Create x and y from all data
y = df['left']
x = df.drop(['left'], axis=1)
# Re-train model on all data
rf_model = rf.fit(x, y)
# Save model
import cPickle
with open('churn_classifier.pkl', 'wb') as fid:
cPickle.dump(rf_model, fid)
```

The unabridged version of the above code can be found here on github.