This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Introduction to Tree Models in Python: Glossary

Key Points

Introduction
  • Understanding your data is key.

  • Data is typically partitioned into training and test sets.

  • Setting random states helps to promote reproducibility.

Decision trees
  • Decision trees are intuitive models that can be used for prediction and regression.

  • Gini impurity is a measure of “impurity”. The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.

  • Greedy algorithms take the optimal decision at a single point, without considering the larger problem as a whole.

Variance
  • Overfitting is a problem that occurs when our algorithm is too closely aligned to our training data.

  • Models that are overfitted may not generalise well to “unseen” data.

  • Pruning is one approach for helping to prevent overfitting.

  • By combining many of instances of “high variance” classifiers, we can end up with a single classifier with low variance.

Boosting
  • An algorithm that performs somewhat poorly at a task - such as simple decision tree - is sometimes referred to as a “weak learner”.

  • With boosting, we create a combination of many weak learners to form a single “strong” learner.

Bagging
  • “Bagging” is short name for bootstrap aggregation.

  • Bootstrapping is a data resampling technique.

  • Bagging is another method for combining multiple weak learners to create a strong learner.

Random forest
  • With Random Forest models, we resample data and use subsets of features.

  • Random Forest are powerful predictive models.

Gradient boosting
  • As a “boosting” method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree.

  • Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.

  • The performance gains come at the cost of interpretability.

Performance
  • There is a large performance gap between different types of tree.

  • Boosted models typically perform strongly.

Glossary

FIXME