Performance

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How well do our predictive models perform?

Objectives

Evaluate the performance of our different models.

Comparing model performance

We’ve now learned the basics of the various tree methods and have visualized most of them. Let’s finish by comparing the performance of our models on our held-out test data. Our goal, remember, is to predict whether or not a patient will survive their hospital stay using the patient’s age and acute physiology score computed on the first day of their ICU stay.

from sklearn import metrics

clf = dict()
clf['Decision Tree'] = tree.DecisionTreeClassifier(criterion='entropy', splitter='best').fit(x_train.values, y_train.values)
clf['Gradient Boosting'] = ensemble.GradientBoostingClassifier(n_estimators=10).fit(x_train.values, y_train.values)
clf['Random Forest'] = ensemble.RandomForestClassifier(n_estimators=10).fit(x_train.values, y_train.values)
clf['Bagging'] =  ensemble.BaggingClassifier(n_estimators=10).fit(x_train.values, y_train.values)
clf['AdaBoost'] =  ensemble.AdaBoostClassifier(n_estimators=10).fit(x_train.values, y_train.values)

fig = plt.figure(figsize=[10,10])

print('AUROC\tModel')
for i, curr_mdl in enumerate(clf):    
    yhat = clf[curr_mdl].predict_proba(x_test.values)[:,1]
    score = metrics.roc_auc_score(y_test, yhat)
    print('{:0.3f}\t{}'.format(score, curr_mdl))
    ax = fig.add_subplot(3,2,i+1)
    glowyr. plot_model_pred_2d(clf[curr_mdl], x_test, y_test, title=curr_mdl)

Here we can see that quantitatively, gradient boosting has produced the highest discrimination among all the models (~0.91). You’ll see that some of the models appear to have simpler decision surfaces, which tends to result in improved generalization on a held-out test set (though not always!).

To make appropriate comparisons, we should calculate 95% confidence intervals on these performance estimates. This can be done a number of ways. A simple but effective approach is to use bootstrapping, a resampling technique. In bootstrapping, we generate multiple datasets from the test set (allowing the same data point to be sampled multiple times). Using these datasets, we can then estimate the confidence intervals.

Key Points

There is a large performance gap between different types of tree.

Boosted models typically perform strongly.

previous episode

Introduction to Tree Models in Python

lesson home

Performance

Overview

Comparing model performance

Key Points

previous episode

lesson home