This lesson is in the early stages of development (Alpha version)

Getting Started with Conda

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What is Conda?

  • Why should I use a package and environment management system as part of my research workflow?

  • Why use Conda (+pip)?

Objectives
  • Understand why you should use a package and environment management system as part of your (data) science workflow.

  • Explain the benefits of using Conda (+pip) as part of your (data) science workflow.

What is Conda?

From the official Conda documentation. Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.

Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.

Conda vs. Miniconda vs. Anaconda

Conda vs. Miniconda vs. Anaconda

Users are often confused about the differences between Conda, Miniconda, and Anaconda. Conda is a tool for managing environments and installing packages. Miniconda combines Conda with Python and a small number of core packages; Anaconda includes Miniconda as well as a large number of the most widely used Python packages.

Why should I use a package and environment management system?

Installing software is hard. Installing scientific software is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.

Installing software system-wide has a number of drawbacks:

Put differently, installing software system-wide creates complex dependencies between your reearch projects that shouldn’t really exist!

Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?

Discussion

What are some of the potential benefits from installing software separately for each project? What are some of the potential costs?

Solution

You may notice that many of the potential benefits from installing software separately for each project require the ability to isolate the projects’ software environments from one another (i.e., solve the environment management problem). Once you have figured out how to isolate project-specific software environments, you will still need to have some way to manage software packages appropriately (i.e., solve the package management problem).

What I hope you will have taken away from the discussion exercise is an appreciation for the fact that in order to install project-specific software environments you need to solve two complementary challenges: environment management and package management.

Environment management

An environment management system solves a number of problems commonly encountered by (data) scientists.

An environment management system enables you to set up a new, project specific software environment containing specific Python versions as well as the versions of additional packages and required dependencies that are all mutually compatible.

Package management

A good package management system greatly simplifies the process of installing software by…

  1. identifying and installing compatible versions of software and all required dependencies.
  2. handling the process of updating software as more recent versions become available.

If you use some flavor of Linux, then you are probably familiar with the package manager for your Linux distribution (i.e., apt on Ubuntu, yum on CentOS); if you are a Mac OSX user then you might be familiar with the Home Brew Project which brings a Linux-like package management system to Mac OS; if you are a Windows OS user, then you may not be terribly familiar with package managers as there isn’t really a standard package manager for Windows (although there is the Chocolatey Project).

Operating system package management tools are great but these tools actually solve a more general problem than you often face as a (data) scientist. As a (data) scientist you typically use one or two core scripting languages (i.e., Python, R, SQL). Each scripting language has multiple versions that can potentially be installed and each scripting language will also have a large number of third-party packages that will need to be installed. The exact version of your core scripting language(s) and additional, third-party packages will also probably change from project to project.

Why use Conda (+pip)?

Whilst there are many different package and environment management systems that solve either the package management problem or the environment management problem, Conda solves both of these problems and explicitly targeted at (data) science use cases.

Additionally, Anaconda provides commonly used data science libraries and tools, such as R, NumPy, SciPy and TensorFlow built using optimised, hardware specific libraries (such as Intel’s MKL or NVIDIA’s CUDA), which provides a speedup without having to change any of your code.

Key Points

  • Conda is a platform agnostic, open source package and environment management system.

  • Using a package and environment management tool facilitates portability and reproducibility of (data) science workflows.

  • Conda (+pip) solves both the package and environment managment problems and targets multiple programming languages. Other open source tools solve either one or the other, or target only a particular programming language.