Scientific validity in the modeling process

Last updated on 2024-06-19 | Edit this page

Overview

Questions

  • What impact does overfitting and underfitting have on model performance?
  • What is data leakage?

Objectives

  • Implement at least two types of machine learning models in Python.
  • Describe the risks of, identify, and understand mitigation steps for overfitting and underfitting.
  • Understand why data leakage is harmful to scientific validity and how it can appear in machine learning pipelines.

Key Points

  • Overfitting is characterized by worse performance on the test set than on the train set and can be fixed by switching to a simpler model architecture or by adding regularization.
  • Underfitting is characterized by poor performance on both the training and test datasets. It can be fixed by collecting more training data, switching to a more complex model architecture, or improving feature quality.
  • Data leakage occurs when the model has access to the test data during training and results in overconfidence in the model’s performance.