This lesson is being piloted (Beta version)

Machine Learning for Biologists: References

Key Points

Introduction
  • Machine learning algorithms recognize patterns from example data

  • Supervised learning involves predicting labels from features

Classifying T-cells
  • The ml4bio software supports interactively exploring different classifiers and hyperparameters on a dataset

  • The machine learning workflow is split into data preprocessing and selection, training and model selection, and evaluation stages

  • Splitting a dataset into training, validation, and testing sets is key to being able to properly evaluate a machine learning method

Evaluating a Model
  • The choice of evaluation metric depends on the relative proportions of different classes in the data, and what we want the model to succeed at.

  • Comparing performance on the validation set with the right metric is an effective way to select a classifier and hyperparameter settings.

Decision Trees, Random Forests, and Overfitting
  • Decision trees require less effort to visualize interpret than other models

  • Decision trees are prone to overfitting

  • Random forests solve many of the problems of decision trees but are more difficult to interpret

Logistic Regression, Artificial Neural Networks, and Linear Separability
  • Logistic regression is a linear classifier.

  • The output of logistic regression is probability of a certain class.

  • Artificial neural networks can be viewed as an extension of logistic regression

  • Artificial neural networks can have nonlinear decision boundaries

Understanding Machine Learning Literature
  • Research workflows for machine learning are often not straightforward

  • Published papers often omit details which can make it difficult to evaluate machine learning workflows

  • Machine learning is used in a large variety of ways in biology

Conclusion and next steps
  • You are now prepared to consider how machine learning may benefit your research.

  • There are many excellent introductory and intermediate resources to help you continue to learn about machine learning.

Glossary and other resources

The Google machine learning glossary and ML4Bio guides define common machine learning terms.

The scikit-learn tutorials provide a Python-based introduction to machine learning. There is also a third-party scikit-learn tutorial and a Carpentries lesson.

The book Python Machine Learning has machine learning example code.

The Elements of AI course presents general introductory materials to machine learning and related topics.

Galaxy ML provides access to classification and regression workflows through the Galaxy interface.

The workshop organizers track additional resources for beginners and intermediate users.

Training classifiers for a research project typically requires training many models and tuning their hyperparameters on a validation dataset. Writing scripts helps automate this process, document the training and tuning decisions, and improve reproducibility. Software Carpentry introduces strategies for script-driven research. A computing cluster helps train and evaluate many machine learning models in parallel.

Jupyter notebook example

You can run an example Jupyter notebook in Binder to see how a machine learning workflow looks in Python code using scikit-learn. The notebook will load an executable Python environment in your web browser. After it loads, you can inspect the code and output or rerun it yourself.