This lesson is being piloted (Beta version)
If you teach this lesson, please tell the authors and provide feedback by opening an issue in the source repository

Introduction to Machine Learning in Python: Glossary

Key Points

Introduction
  • Machine learning borrows heavily from fields such as statistics and computer science.

  • In machine learning, models learn rules from data.

  • In supervised learning, the target in our training data is labelled.

  • A.I. has become a synonym for machine learning.

  • A.G.I. is the loftier goal of achieving human-like intelligence.

Data preparation
  • Data pre-processing is arguably the most important task in machine learning.

  • SQL is the tool that we use to extract data from database systems.

  • Data is typically partitioned into training and test sets.

  • Setting random states helps to promote reproducibility.

Learning
  • Loss functions allow us to define a good model.

  • $y$ is a known target. $\hat{y}$ ($y hat$) is a prediction.

  • Mean squared error is an example of a loss function.

  • After defining a loss function, we search for the optimal solution in a process known as ‘training’.

  • Optimisation is at the heart of machine learning.

Modelling
  • Linear regression is a popular model for regression tasks.

  • Logistic regression is a popular model for classification tasks.

  • Probabilities that can be mapped to a prediction class.

Validation
  • Validation sets are used during model development, allowing models to be tested prior to testing on a held-out set.

  • Cross-validation is a resampling technique that creates multiple validation sets.

  • Cross-validation can help to avoid overfitting.

Evaluation
  • Confusion matrices are the basis for many popular performance metrics.

  • AUROC is the area under the receiver operating characteristic. 0.5 is bad!

  • TP is True Positive, meaning that our prediction hit its target.

Bootstrapping
  • Bootstrapping is a resampling technique, sometimes confused with cross-validation.

  • Bootstrapping allows us to generate a distribution of estimates, rather than a single point estimate.

  • Bootstrapping allows us to estimate uncertainty, allowing computation of confidence intervals.

Data leakage
  • Leakage occurs when training data is contaminated with information that is not available at prediction time.

  • Leakage leads to over-optimistic expectations of performance.

Glossary

FIXME