This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

# Introduction

## Overview

Teaching: 20 min
Exercises: 10 min
Questions
• What steps are needed to prepare data for analysis?

• How do I create training and test sets?

Objectives

• Explore summary characteristics of the data.

• Prepare the data for analysis.

## Predicting the outcome of critical care patients

We would like to develop an algorithm that can be used to predict the outcome of patients who are admitted to intensive care units using observations available on the day of admission.

Our analysis focuses on ~1000 patients admitted to critical care units in the continental United States. Data is provided by the Philips eICU Research Institute, a critical care telehealth program.

We will use decision trees for this task. Decision trees are a family of intuitive “machine learning” algorithms that often perform well at prediction and classification.

We will begin by loading a set of observations from our critical care dataset. The data includes variables collected on Day 1 of the stay, along with outcomes such as length of stay and in-hospital mortality.

``````# import libraries
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Display the first 5 rows of the data
``````

The data has been assigned to a dataframe called `cohort`. Let’s take a look at the first few lines:

index gender age admissionweight unabridgedhosplos acutephysiologyscore apachescore actualhospitalmortality heartrate meanbp creatinine temperature respiratoryrate wbc admissionheight
0 Female 48 86.4 27.5583 44 49 ALIVE 102.0 54.0 1.16 36.9 39.0 6.1 177.8
1 Female 59 66.6 15.0778 56 61 ALIVE 134.0 172.0 1.03 34.8 32.0 25.5 170.2
2 Male 31 66.8 2.7326 45 45 ALIVE 138.0 71.0 2.35 37.2 34.0 21.4 188.0
3 Female 51 77.1 0.1986 19 24 ALIVE 122.0 73.0 -1.0 36.8 26.0 -1.0 160.0
4 Female 48 63.4 1.7285 25 30 ALIVE 130.0 68.0 1.1 -1.0 29.0 7.6 172.7

## Preparing the data for analysis

We first need to do some basic data preparation.

``````# Encode the categorical data
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
cohort['actualhospitalmortality_enc'] = encoder.fit_transform(cohort['actualhospitalmortality'])
``````

In the eICU Research Database, ages over 89 years are recorded as “>89” to comply with US data privacy laws. For simplicity, we will assign an age of 91.5 years to these patients (this is the approximate average age of patients over 89 in the dataset).

``````# Handle the deidentified ages
cohort['age'] = pd.to_numeric(cohort['age'], downcast='integer', errors='coerce')
cohort['age'] = cohort['age'].fillna(value=91.5)
``````

Now let’s use the tableone package to review our dataset.

``````!pip install tableone

from tableone import tableone

t1 = tableone(cohort, groupby='actualhospitalmortality')
print(t1.tabulate(tablefmt = "github"))
``````

The table below shows summary characteristics of our dataset:

Missing Overall ALIVE EXPIRED
n     536 488 48
gender, n (%) Female 0 305 (56.9) 281 (57.6) 24 (50.0)
Male   230 (42.9) 207 (42.4) 23 (47.9)
Unknown   1 (0.2)   1 (2.1)
age, mean (SD)   0 63.4 (17.4) 62.2 (17.4) 75.2 (12.6)
admissionweight, mean (SD)   16 81.8 (25.0) 82.3 (25.1) 77.0 (23.3)
unabridgedhosplos, mean (SD)   0 5.6 (6.8) 5.7 (6.7) 4.3 (7.8)
acutephysiologyscore, mean (SD)   0 41.7 (22.7) 38.5 (18.8) 74.3 (31.7)
apachescore, mean (SD)   0 53.6 (25.1) 49.9 (21.1) 91.8 (30.5)
heartrate, mean (SD)   0 101.5 (32.9) 100.3 (31.9) 113.9 (40.0)
meanbp, mean (SD)   0 89.6 (41.5) 90.7 (40.7) 78.8 (47.6)
creatinine, mean (SD)   0 0.8 (2.0) 0.8 (2.0) 1.4 (1.8)
temperature, mean (SD)   0 35.6 (5.6) 35.9 (4.8) 32.9 (10.4)
respiratoryrate, mean (SD)   0 27.4 (15.5) 26.8 (15.4) 33.9 (15.2)
wbc, mean (SD)   0 6.5 (7.6) 6.2 (7.1) 9.9 (11.2)
admissionheight, mean (SD)   8 168.4 (14.5) 168.2 (13.6) 170.3 (21.5)
actualhospitalmortality_enc, n (%) 0 0 488 (91.0) 488 (100.0)
1   48 (9.0)   48 (100.0)

## Question

a) What proportion of patients survived their hospital stay?
b) What is the “apachescore” variable? Hint, see the Wikipeda entry for the Apache Score.
c) What is the average age of patients?

a) 91% of patients survived their stay. There is 9% in-hospital mortality.
b) APACHE (“Acute Physiology and Chronic Health Evaluation II”) is a severity-of-disease classification system. It is applied within 24 hours of admission of a patient to an intensive care unit. Higher scores correspond to more severe disease and a higher risk of death.
c) The median age is 64 years. Remember that the age of patients above 89 years is unknown. Median is therefore a better measure of central tendency. The median age can be calculated with `cohort['age'].median()`.

## Creating train and test sets

We will only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables (or “features”) will make it easier to visualize our models.

``````from sklearn.model_selection import train_test_split

features = ['age','acutephysiologyscore']
outcome = 'actualhospitalmortality_enc'

x = cohort[features]
y = cohort[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.7, random_state =  42)
``````

## Question

a) Why did we split our data into training and test sets?
b) What is the effect of setting a random state in the splotting algorithm?

a) We want to be able to evaluate our model on data that it has not seen before. If we evaluate our model on data that it is trained upon, we will overestimate the performance.
b) Setting the random state means that the split will be deterministic (i.e. we will all see the same “random” split). This helps to ensure our analysis is reproducible.

## Key Points

• Understanding your data is key.

• Data is typically partitioned into training and test sets.

• Setting random states helps to promote reproducibility.