This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

This lesson is part of The Carpentries Incubator, a place to share and use each other's Carpentries-style lessons. This lesson has not been reviewed by and is not endorsed by The Carpentries.

Introduction to Tree Models in Python: Glossary

Key Points

Introduction	Understanding your data is key. Data is typically partitioned into training and test sets. Setting random states helps to promote reproducibility.
Decision trees	Decision trees are intuitive models that can be used for prediction and regression. Gini impurity is a measure of “impurity”. The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5. Greedy algorithms take the optimal decision at a single point, without considering the larger problem as a whole.
Variance	Overfitting is a problem that occurs when our algorithm is too closely aligned to our training data. Models that are overfitted may not generalise well to “unseen” data. Pruning is one approach for helping to prevent overfitting. By combining many of instances of “high variance” classifiers, we can end up with a single classifier with low variance.
Boosting	An algorithm that performs somewhat poorly at a task - such as simple decision tree - is sometimes referred to as a “weak learner”. With boosting, we create a combination of many weak learners to form a single “strong” learner.
Bagging	“Bagging” is short name for bootstrap aggregation. Bootstrapping is a data resampling technique. Bagging is another method for combining multiple weak learners to create a strong learner.
Random forest	With Random Forest models, we resample data and use subsets of features. Random Forest are powerful predictive models.
Gradient boosting	As a “boosting” method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree. Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting. The performance gains come at the cost of interpretability.
Performance	There is a large performance gap between different types of tree. Boosted models typically perform strongly.

Glossary

FIXME