This lesson is in the early stages of development (Alpha version)

Introduction to Conda for (Data) Scientists

Getting Started with Conda

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • What is Conda?

  • Why should I use a package and environment management system as part of my research workflow?

  • Why use Conda (+pip)?

Objectives
  • Understand why you should use a package and environment management system as part of your (data) science workflow.

  • Explain the benefits of using Conda (+pip) as part of your (data) science workflow.

What is Conda?

From the official Conda documentation. Conda is an open source package and environment management system that runs on Windows, Mac OS and Linux.

Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.

Conda vs. Miniconda vs. Anaconda

Conda vs. Miniconda vs. Anaconda

Users are often confused about the differences between Conda, Miniconda, and Anaconda. Conda is a tool for managing environments and installing packages. Miniconda combines Conda with Python and a small number of core packages; Anaconda includes Miniconda as well as a large number of the most widely used Python packages.

Why should I use a package and environment management system?

Installing software is hard. Installing scientific software is often even more challenging. In order to minimize the burden of installing and updating software (data) scientists often install software packages that they need for their various projects system-wide.

Installing software system-wide has a number of drawbacks:

Put differently, installing software system-wide creates complex dependencies between your reearch projects that shouldn’t really exist!

Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?

Discussion

What are some of the potential benefits from installing software separately for each project? What are some of the potential costs?

Solution

You may notice that many of the potential benefits from installing software separately for each project require the ability to isolate the projects’ software environments from one another (i.e., solve the environment management problem). Once you have figured out how to isolate project-specific software environments, you will still need to have some way to manage software packages appropriately (i.e., solve the package management problem).

What I hope you will have taken away from the discussion exercise is an appreciation for the fact that in order to install project-specific software environments you need to solve two complementary challenges: environment management and package management.

Environment management

An environment management system solves a number of problems commonly encountered by (data) scientists.

An environment management system enables you to set up a new, project specific software environment containing specific Python versions as well as the versions of additional packages and required dependencies that are all mutually compatible.

Package management

A good package management system greatly simplifies the process of installing software by…

  1. identifying and installing compatible versions of software and all required dependencies.
  2. handling the process of updating software as more recent versions become available.

If you use some flavor of Linux, then you are probably familiar with the package manager for your Linux distribution (i.e., apt on Ubuntu, yum on CentOS); if you are a Mac OSX user then you might be familiar with the Home Brew Project which brings a Linux-like package management system to Mac OS; if you are a Windows OS user, then you may not be terribly familiar with package managers as there isn’t really a standard package manager for Windows (although there is the Chocolatey Project).

Operating system package management tools are great but these tools actually solve a more general problem than you often face as a (data) scientist. As a (data) scientist you typically use one or two core scripting languages (i.e., Python, R, SQL). Each scripting language has multiple versions that can potentially be installed and each scripting language will also have a large number of third-party packages that will need to be installed. The exact version of your core scripting language(s) and additional, third-party packages will also probably change from project to project.

Why use Conda (+pip)?

Whilst there are many different package and environment management systems that solve either the package management problem or the environment management problem, Conda solves both of these problems and explicitly targeted at (data) science use cases.

Additionally, Anaconda provides commonly used data science libraries and tools, such as R, NumPy, SciPy and TensorFlow built using optimised, hardware specific libraries (such as Intel’s MKL or NVIDIA’s CUDA), which provides a speedup without having to change any of your code.

Key Points

  • Conda is a platform agnostic, open source package and environment management system.

  • Using a package and environment management tool facilitates portability and reproducibility of (data) science workflows.

  • Conda (+pip) solves both the package and environment managment problems and targets multiple programming languages. Other open source tools solve either one or the other, or target only a particular programming language.


Working with Environments

Overview

Teaching: 60 min
Exercises: 15 min
Questions
  • What is a Conda environment?

  • How do I create (delete) an environment?

  • How do I activate (deactivate) an environment?

  • How do I install packages into existing environments using Conda (+pip)?

  • Where should I create my environments?

  • How do I find out what packages have been installed in an environment?

  • How do I find out what environments that exist on my machine?

  • How do I delete an environment that I no longer need?

Objectives
  • Understand how Conda environments can improve your research workflow.

  • Create a new environment.

  • Activate (deactivate) a particular environment.

  • Install packages into existing environments using Conda (+pip).

  • Specify the installation location of an environment.

  • List all of the existing environments on your machine.

  • List all of the installed packages within a particular environment.

  • Delete an entire environment.

What is a Conda environment

A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. For example, you may be working on a research project that requires NumPy 1.18 and its dependencies, while another environment associated with an finished project has NumPy 1.12 (perhaps because version 1.12 was the most current version of NumPy at the time the project finished). If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.

Avoid installing packages into your base Conda environment

Conda has a default environment called base that include a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base software environment. Additional packages needed for a new project should always be installed into a newly created Conda environment.

Creating environments

To create a new environment for Python development using conda you can use the conda create command.

$ conda create --name python3-env python pip

Always install pip in your Python environments

Pip, the default Python package manager, is often already installed on most operating systems (where it is used to manage any packages need by the OS Python). Pip is also included in the Miniconda installer. Including pip as an explicit dependency in your Conda environment avoids difficult to debug issues that can arise when installing packages into environments using some other pip installed outside your environment.

It is a good idea to give your environment a meaningful name in order to help yourself remember the purpose of the environment. While naming things can be difficult, $PROJECT_NAME-env is a good convention to follow.

The command above will create a new Conda environment called “python3” and install the most recent version of Python. If you wish, you can specify a particular version of packages for conda to install when creating the environment.

$ conda create --name python36-env python=3.6 pip=20.0

Always specify a version number for each package you wish to install

In order to make your results more reproducible and to make it easier for research colleagues to recreate your Conda environments on their machines it is a “best practice” to always explicitly specify the version number for each package that you install into an environment. If you are not sure exactly which version of a package you want to use, then you can use search to see what versions are available using the conda search command.

$ conda search $PACKAGE_NAME

So, for example, if you wanted to see which versions of Scikit-learn, a popular Python library for machine learning, were available, you would run the following.

$ conda search scikit-learn

As always you can run conda search --help to learn about available options.

You can create a Conda environment and install multiple packages by simply listing the packages that you wish to install.

$ conda create --name basic-scipy-env ipython=7.13 matplotlib=3.1 numpy=1.18 pip=20.0 scipy=1.4

When conda installs a package into an environment it also installs any required dependencies. For example, even though Python is not listed as a packaged to install into the basic-scipy-env environment above, conda will still install Python into the environment because it is a required dependency of at least one of the listed packages.

Creating a new environment

Create a new environment called “machine-learning-env” with Python and the most current versions of IPython, Matplotlib, Pandas, and Scikit-Learn.

Solution

In order to create a new environment you use the conda create command as follows.

$ conda create --name machine-learning-env \
> ipython \
> matplotlib \
> pandas \
> pip \
> python \
> scikit-learn \

Since no version numbers are provided for any of the Python packages, Conda will download the most current, mutually compatible versions of the requested packages. However, since it is best practice to always provide explicit version numbers, you should prefer the following solution.

$ conda create --name machine-learning-env \
> ipython=7.13
> matplotlib=3.1 \
> pandas=1.0 \
> pip=20.0
> python=3.6 \
> scikit-learn=0.22

Activating an existing environment

Activating environments is essential to making the software in environments work well (or sometimes at all!). Activation of an environment does two things.

  1. Adds entries to PATH for the environment.
  2. Runs any activation scripts that the environment may contain.

Step 2 is particularly important as activation scripts are how packages can set arbitrary environment variables that may be necessary for their operation. Aou activate the basic-scipy-env environment by name using the activate command.

$ conda activate basic-scipy-env

You can see that an environment has been activated because the shell prompt will now include the name of the active environment.

(basic-scipy-env) $

Deactivate the current environment

To deactivate the currently active environment use the deactivate command as follows.

(basic-scipy-env) $ conda deactivate

You can see that an environment has been deactivated because the shell prompt will no longer include the name of the previously active environment.

$

Returning to the base environment

To simply return to the base Conda environment, it’s better to call conda activate with no environment specified, rather than to use deactivate. If you run conda deactivate from your base environment, you may lose the ability to run conda commands at all. Don’t worry if you encounter this undesirable state! Just start a new shell.

Activate an existing environment by name

Activate the machine-learning-env environment created in the previous challenge by name.

Solution

In order to activate an existing environment by name you use the conda activate command as follows.

$ conda activate machine-learning-env

Deactivate the active environment

Deactivate the machine-learning-env environment that you activated in the previous challenge.

Solution

In order to deactivate the active environment you use the conda deactivate command.

(active-environment-name) $ conda deactivate

Installing a package into an existing environment

You can install a package into an existing environment using the conda install command. This command accepts a list of package specifications (i.e., numpy=1.18) and installs a set of packages consistent with those specifications and compatible with the underlying environment. If full compatibility cannot be assured, an error is reported and the environment is not changed.

By default the conda install command will install packages into the current, active environment. The following would activate the basic-scipy-env we created above and install Numba, an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code, into the active environment.

$ conda activate basic-scipy-env
$ conda install numba

As was the case when listing packages to install when using the conda create command, if version numbers are not explicitly provided, Conda will attempt to install the newest versions of any requested packages. To accomplish this, Conda may need to update some packages that are already installed or install additional packages. It is always a good idea to explicitly provide version numbers when installing packages with the conda install command. For example, the following would install a particular version of Scikit-Learn, into the current, active environment.

$ conda install scikit-learn=0.22

Freezing installed packages

To prevent existing packages from being updating when using the conda install command, you can use the --freeze-installed option. This may force Conda to install older versions of the requested packages in order to maintain compatibility with previously installed packages. Using the --freeze-installed option does not prevent additional dependency packages from being installed.

Installing a package into a specific environment

Dask provides advanced parallelism for data science workflows enabling performance at scale for the core Python data science tools such as Numpy Pandas, and Scikit-Learn. Have a read through the official documentation for the conda install command and see if you can figure out how to install Dask into the machine-learning-env that you created in the previous challenge.

Solution

You can install Dask into machine-learning-env using the conda install command as follow.

$ conda install --name machine-learning-env dask=2.16

You could also install Dask into machine-learning-env by first activating that environment and then using the conda install command.

$ conda activate machine-learning-env
$ conda install dask=2.16

Installing packages into Conda environments using pip

Combo is a comprehensive Python toolbox for combining machine learning models and scores. Model combination can be considered as a subtask of ensemble learning, and has been widely used in real-world tasks and data science competitions like Kaggle.

Activate the machine-learning-env you created in a previous challenge and use pip to install combo.

Solution

The following commands will activate the machine-learning-env and install combo.

$ conda activate machine-learning-env
$ pip install combo==0.1.*

For more details on using pip see the official documentation.

Where do Conda environments live?

Environments created with conda, by default, live in the envs/ folder of your miniconda3 directory the absolute path to which will look something the following.

$ /Users/$USERNAME/miniconda3/envs

Running ls on your ~/miniconda3/envs/ directory will list out the directories containing the existing Conda environments.

Location of Conda environments on Binder

If you are working through these lessons using a Binder instance, then the default location of the Conda environments is slightly different.

$ /srv/conda/envs

Running ls /srv/conda/envs/ from a terminal will list out the directories containing any previously installed Conda environments.

How do I specify a location for a Conda environment?

You can control where a Conda environment lives by providing a path to a target directory when creating the environment. For example to following command will create a new environment in a sub-directory of the current working directory called env.

$ conda create --prefix ./env ipython=7.13 matplotlib=3.1 pandas=1.0 python=3.6

You activate an environment created with a prefix using the same command used to activate environments created by name.

$ conda activate ./env

It is a good idea to always specify a path to a sub-directory of your project directory when creating an environment. Why?

  1. Makes it easy to tell if your project utilizes an isolated environment by including the environment as a sub-directory.
  2. Makes your project more self-contained as everything including the required software is contained in a single project directory.

An additional benefit of creating your project’s environment inside a sub-directory is that you can then use the same name for all your environments; if you keep all of your environments in your ~/miniconda3/env/ folder, you’ll have to give each of them a different name.

Conda environment sub-directory naming convention

In order to be consistent with the convention used by tools such as venv and Pipenv, I recommend using env as the name of the sub-directory of your project directory that contains your Conda environment. A benefit of maintaining the convention is that your environment sub-directory will be automatically ignored by the default Python .gitignore file used on GitHub.

Whatever naming convention you adopt it is importanat to be consistent! Using the same name for all of your Conda environments allows you to use the same activate command as well.

$ cd my-project/
$ conda activate ./env

Creating a new environment as a sub-directory within a project directory

First create a project directory called project-dir using the following command.

$ mkdir project-dir
$ cd project-dir

Next, create a new environment inside the newly created project-dir in a sub-directory called env an install Python 3.6, version 3.1 of Matplotlib, version 2.1 of TensorFlow and version 20.0 of pip.

Solution

project-dir $ conda create --prefix ./env \
> python=3.6 \
> matplotlib=3.1 \
> tensorflow=2.1 \
> pip=20.0

Placing Conda environments outside of the default ~/miniconda3/envs/ folder comes with a couple of minor drawbacks. First, conda can no longer find your environment with the --name flag; you’ll generally need to pass the --prefix flag along with the environment’s full path to find the environment.

Second, an annoying side-effect of specifying an install path when creating your Conda environments is that your command prompt is now prefixed with the active environment’s absolute path rather than the environment’s name. After activating an environment using its prefix your prompt will look similar to the following.

(/absolute/path/to/env) $

As you can imagine, this can quickly get out of hand.

(/Users/USER_NAME/research/data-science/PROJECT_NAME/env) $

If (like me!) you find this long prefix to your shell prompt annoying, then there is a quick fix: modify the env_prompt setting in your .condarc file, which you can do with the following command.

$ conda config --set env_prompt '({name})'

This will either edit your ~/.condarc file if you already have one or create a ~/.condarc file if you do not. Now your command prompt will display the active environment’s generic name.

$ cd project-directory
$ conda activate ./env
(env) project-directory $

For more on modifying your .condarc file, see the official Conda docs.

Activate an existing environment by path

Activate the environment created in a previous challenge using the path to the environment directory.

Solution

You can activate an existing environment by providing the path the the environment directory instead of the environment name when using the conda activate command as follows.

$ conda activate ./env

Note that the provided path can either be absolute or relative. If the path is a relative path then it must start with ./ on Unix systems and .\ on Windows.

Conda can create environments for R projects too!

First create a project directory called r-project-dir using the following command.

$ mkdir r-project-dir
$ cd r-project-dir

Next, take a look through the list of R packages available by default for installation using conda. Create a new environment inside the newly created r-project-dir in a sub-directory called env and install r-base (and any other R packages that you think look interesting).

Solution

project-dir $ conda create --prefix ./env \
> r-base \
> r-tidyverse \
> r-sparklyr \

Listing existing environments

Now that you have created a number of Conda environments on your local machine you have probably forgotten the names of all of the environments and exactly where they live. Fortunately, there is a conda command to list all of your existing environments together with their locations.

$ conda env list

Listing the contents of an environment

In addition to forgetting names and locations of Conda environments, at some point you will probably forget exactly what has been installed in a particular Conda environment. Again, there is a conda command for listing the contents on an environment. To list the contents of the basic-scipy-env that you created above, run the following command.

$ conda list --name basic-conda-env

If you created your Conda environment using the --prefix option to install packages into a particular directory, then you will need to use that prefix in order for conda to locate the environment on your machine.

$ conda list --prefix /path/to/conda-env

Listing the contents of a particular environment.

List the packages installed in the machine-learning-env environment that you created in a previous challenge.

Solution

You can list the packages and their versions installed in machine-learning-env using the conda list command as follows.

$ conda list --name machine-learning-env

To list the packages and their versions installed in the active environment simply leave off the --name or --prefix option.

$ conda list

Deleting entire environments

Occasionally, you will want to delete an entire environment. Perhaps you were experimenting with conda commands and you created an environment you have no intention of using; perhaps you no longer need an existing environment and just want to get rid of cruft on your machine. Whatever the reason the command to delete an environment is the following.

$ conda remove --name my-first-conda-env --all

If you wish to delete and environment that you created with a --prefix option, then you will need to provide the prefix again when removing the environment.

$ conda remove --prefix /path/to/conda-env/ --all

Delete an entire environment

Delete the entire “basic-scipy-env” environment.

Solution

In order to delete an entire environment you use the conda remove command as follows.

$ conda remove --name basic-scipy-env --all --yes

This command will remove all packages from the named environment before removing the environment itself. The use of the --yes flag short-circuits the confirmation prompt (and should be used with caution).

Key Points

  • A Conda environment is a directory that contains a specific collection of Conda packages that you have installed.

  • You create (remove) a new environment using the conda create (conda remove) commands.

  • You activate (deactivate) an environment using the conda activate (conda deactivate) commands.

  • You install packages into environments using conda install; you install packages into an active environment using pip install.

  • You should install each environment as a sub-directory inside its corresponding project directory

  • Use the conda env list command to list existing environments and their respective locations.

  • Use the conda list command to list all of the packages installed in an environment.


Sharing Environments

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • Why should I share my Conda environment with others?

  • How do I share my Conda environment with others?

  • How do I create a custom kernel for my Conda environments inside JupyterLab?

Objectives
  • Create an environment from a YAML file that can be read by Windows, Mac OS, or Linux.

  • Create an environment based on exact package versions.

  • Create a custom kernel for a Conda environment for use inside JupyterLab and Jupyter notebooks.

Working with environment files

When working on a collaborative research project it is often the case that your operating system might differ from the operating systems used by your collaborators. Similarly, the operating system used on a remote cluster to which you have access will likely differ from the operating system that you use on your local machine. In these cases it is useful to create an operating system agnostic environment file which you can share with collaborators or use to re-create an environment on a remote cluster.

Creating an environment file

In order to make sure that your environment is truly shareable, you need to make sure that that the contents of your environment are described in such a way that the resulting environment file can be used to re-create your environment on Linux, Mac OS, and Windows. Conda uses YAML (“YAML Ain’t Markup Language”) for writing its environment files. YAML is a human-readable data-serialization language that is commonly used for configuration files and that uses Python-style indentation to indicate nesting.

Creating you project’s Conda environment from a single environment file is a Conda “best practice”. Not only do you have a file to share with collaborators but you also have a file that can be placed under version control which further enhancing the reproducibility of your research project and workflow.

Default environment.yml file

Note that by convention Conda environment files are called environment.yml. As such if you use the conda env create sub-command without passing the --file option, then conda will expect to find a file called environment.yml in the current working directory and will throw an error if a file with that name can not be found.

Let’s take a look at a few example environment.yml files to give you an idea of how to write your own environment files.

name: machine-learning-env

dependencies:
  - ipython
  - matplotlib
  - pandas
  - pip
  - python
  - scikit-learn

This environment.yml file would create an environment called machine-learning-env with the most current and mutually compatible versions of the listed packages (including all required dependencies). The newly created environment would be installed inside the ~/miniconda3/envs/ directory. Alternatively, if we intended this environment file to be used to create an environment inside a sub-directory call ./env of the project directory, then we should set then name key to null as follows.

name: null

dependencies:
  - ipython
  - matplotlib
  - pandas
  - pip
  - python
  - scikit-learn

Finally, since explicit versions numbers for all packages should be preferred a better environment file would be the following.

name: null

dependencies:
  - ipython=7.13
  - matplotlib=3.1
  - pandas=1.0
  - pip=20.0
  - python=3.6
  - scikit-learn=0.22

Note that we are only specifying the major and minor version numbers and not the patch or build numbers. Defining the version number by fixing only the major and minor version numbers while allowing the patch version number to vary allows us to use our environment file to update our environment to get any bug fixes whilst still maintaining significant consistency of our Conda environment across updates.

Always version control your environment.yml files!

While you should never version control the contents of your env/ environment sub-directory, you should always version control your environment.yml files. Version controlling your environment.yml files together with your project’s source code means that you always know which versions of which packages were used to generate your results at any particular point in time.

Let’s suppose that you want to use the environment.yml file defined above to create a Conda environment in a sub-directory of some project directory. Here is how you would accomplish this task.

$ cd project-dir
$ conda env create --prefix ./env --file environment.yml
$ conda activate ./env

Note that the above sequence of commands assumes that the environment.yml file is stored within your project-dir directory.

Beware the conda env export command

Many other Conda tutorials (including the official documentation) encourage the use of the conda env export command to export an existing environment. For example, to export the packages installed into the previously created machine-learning-env you would run the following command.

$ conda env export --name machine-learning-env --no-builds

When you run this command, you will see the resulting YAML formatted representation of your Conda environment streamed to the terminal. Recall that we only listed five packages when we originally created machine-learning-env yet from the output of the conda env export command we see that these five packages result in an environment with roughly 80 dependencies!

In practice this command does not consistently produce environments that are reproducible across Mac OS, Windows, and Linux. The issue is that even after removing the build numbers (by passing the --no-builds option), an environment file exported from an environment created on, say Mac OS, will often still contain Mac OS specific packages that will not exist for Windows or Linux.

Create a new environment from a YAML file.

Create a new project directory and then create a new environment.yml file inside your project directory with the following contents.

name: xgboost-env

dependencies:
  - ipython=7.13
  - matplotlib=3.1
  - pandas=1.0
  - pip=20.0
  - python=3.6
  - scikit-learn=0.22
  - xgboost=1.0

Now use this file to create a new Conda environment. Where is this new environment created? Using the same environment.yml file create a Conda environment as a sub-directory called env/ inside a newly created project directory. Compare the contents of the two environments.

Solution

To create a new environment from a YAML file use the conda env create sub-command as follows.

$ mkdir project-dir
$ cd project-dir
$ nano environment.yml
$ conda env create --file environment.yml

The above sequence of commands will create a new Conda environment inside the ~/miniconda3/envs directory. In order to create the Conda environment inside a sub-directory of the project directory you need to pass the --prefix to the conda env create command as follows.

$ conda env create --file environment.yml --prefix ./env

You can now run the conda env list command and see that these two environments have been created in different locations but contain the same packages.

Updating an envionment

You are unlikely to know ahead of time which packages (and version numbers!) you will need to use for your research project. For example it may be the case that…

If any of these occurs during the course of your research project, all you need to do is update the contents of your environment.yml file accordingly and then run the following command.

$ conda env update --prefix ./env --file environment.yml  --prune

Note that the --prune option cause Conda to remove any dependencies that are no longer required from the environment.

Rebuilding a Conda environment from scratch

When working with environment.yml files it is often just as easy to rebuild the Conda environment from scratch whenever you need to add or remove dependencies. To rebuild a Conda environment from scatch you simply pass the --force option to the conda env create command which will remove any existing environment directory before rebuilding it using the provided environment file.

$ conda env create --prefix ./env --file environment.yml --force

Add Dask to the environment to scale up your analytics

Add to the xgboost-env environment file and update the environment. Dask provides advanced parallelism for data science workflows enabling performance at scale for the core Python data science tools such as Numpy Pandas, and Scikit-Learn.

Solution

The environment.yml file should now look as follows.

name: xgboost-env

dependencies:
  - dask=2.16
  - dask-ml=1.4
  - dask-xgboost=0.1
  - ipython=7.13
  - matplotlib=3.1
  - pandas=1.0
  - pip=20.0
  - python=3.6
  - scikit-learn=0.22
  - xgboost=1.0

The following command will rebuild the environment with the new Dask dependencies.

$ conda env update --prefix ./env --file environment.yml --force 

Making Jupyter aware of your Conda environments

Both JupyterLab and Jupyter Notebooks automatically ensure that the standard IPython kernel is always available by default. However, if you want to use a kernel based on a particular Conda environment from inside Jupyter (and Juptyer is not installed inside your environment) then will need to create a kernel spec file for your Conda environments manually.

Before you can create a custom kernel for you Conda environment you need to make sure that the ipykernel package is installed in your Conda environment as you will need to use this package to create the kernel spec file. Here is the updated xgboost-env environment.yml file that includes the ipykernel package.

name: xgboost-env

dependencies:
  - ipykernel=5.3
  - ipython=7.13
  - matplotlib=3.1
  - pandas=1.0
  - pip=20.0
  - python=3.6
  - scikit-learn=0.22
  - xgboost=1.0

Next, rebuild the Conda environment using the following command.

$ conda env create --prefix ./env --file environment.yml --force

Once the Conda environment has been re-built you can activate the environment and then create the custom kernel for the activated environment.

$ conda activate ./env
$ python -m ipykernel install --user --name xgboost-env --display-name "XGBoost"

The last command installs a kernel spec file for the current environment. Kernel spec files are JSON files which can be viewed and changed with a normal text editor. The --name value is used by Jupyter internally; --display-name is what you see in the JupyterLab launcher menu as well as the Jupyter Notebook dropdown kernel menu. This command will overwrite any existing kernel with the same name.

Create a kernel for a Conda environment

Create a custom kernel for the machine-learning-env environment created in a previous challenge.

Solution

In order to activate an existing environment by name you use the conda activate command as follows.

$ conda activate machine-learning-env
$ python -m ipykernel install --user --name machine-learning-env

Note that by leaving the --display-name unspecified, the display name will match the value provided to --name .

Key Points

  • Sharing Conda environments with other researchers facilitates the reprodicibility of your research.

  • Create anenvironment.yml file that describes your project’s software environment.

  • Creating custom kernels enables you to connect your Conda environments to an existing JupterLab install.


Using Packages and Channels

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • What are Conda channels?

  • What are Conda packages?

  • Why should I be explicit about which channels my research project uses?

Objectives
  • Install a package from a specific channel.

What are Conda packages?

A [conda package][conda-pkg-docs] is a compressed tarball file (.tar.bz2) that contains:

Conda keeps track of the dependencies between packages and platforms; the conda package format is identical across platforms and operating systems.

Package Structure

All conda packages have a specific sub-directory structure inside the tarball file. There is a bin directory that contains any binaries for the package; a lib directory containing the relevant library files (i.e., the .py files); and an info directory containing package metadata. For a more details of the conda package specification, including discussions of the various metadata files, see the [docs][conda-pkg-spec-docs].

As an example of Conda package structure consider the Conda package for Python 3.6 version of PyTorch targeting a 64-bit Mac OS, pytorch-1.1.0-py3.6_0.tar.bz2.

.
├── bin
│   └── convert-caffe2-to-onnx
│   └── convert-onnx-to-caffe2
├── info
│   ├── LICENSE.txt
│   ├── about.json
│   ├── files
│   ├── git
│   ├── has_prefix.json
│   ├── hash_input.json
│   ├── index.json
│   ├── paths.json
│   ├── recipe/
│   └── test/
└── lib
    └── python3.6
        └── site-packages
            ├── caffe2/
            ├── torch/
            └── torch-1.1.0-py3.6.egg-info/

A complete listing of available PyTorch packages can be found on Anaconda Cloud.

What are Conda channels?

Again from the [Conda documentation][conda-channels-docs], conda packages are downloaded from remote channels, which are URLs to directories containing conda packages. The conda command searches a default set of channels, and packages are automatically downloaded and updated from the Anaconda Cloud channels.

Collectively, the Anaconda managed channels are referred to as the defaults channel because, unless otherwise specified, packages installed using conda will be downloaded from these channels.

The conda-forge channel

In addition to the default channels that are managed by Anaconda Inc., there is another channel called that also has a special status. The Conda-Forge project “is a community led collection of recipes, build infrastructure and distributions for the conda package manager.”

There are a few reasons that you may wish to use the conda-forge channel instead of the defaults channel maintained by Anaconda:

  1. Packages on conda-forge may be more up-to-date than those on the defaults channel.
  2. There are packages on the conda-forge channel that aren’t available from defaults.
  3. You may wish to use a dependency such as openblas (from conda-forge) instead of mkl (from defaults).

How do I install a package from a specific channel?

You can install a package from a specific channel into the currently activate environment by passing the --channel option to the conda install command as follows.

$ conda install scipy=1.3 --channel conda-forge

You can also install a package from a specific channel into a named environment (using --name) or into an environment installed at a particular prefix (using --prefix). For example, the following command installs the scipy package from the conda-forge channel into the environment called my-first-conda-env which we created eariler.

$ conda install scipy=1.3 --channel conda-forge --name my-first-conda-env

This command would install tensorflow package from conda-forge channel into an environment installed into the env/ sub-directory.

$ conda install tensorflow=1.13 --channel conda-forge --prefix ./env

Here is another example for R users. The following command would install r-tidyverse package from the r channel into an environment installed into the env/ sub-directory.

$ conda install r-tidyverse=1.2 --channel r --prefix ./env

In this case the --channel option is unnecessary because the r channel is included by default. The following works just as well!

$ conda install r-tidyverse=1.2 --prefix ./env

Channel priority

You may specify multiple channels for installing packages by passing the --channel argument multiple times.

$ conda install scipy=1.3 --channel conda-forge --channel bioconda

Channel priority decreases from left to right - the first argument has higher priority than the second. For reference, bioconda is a channel for the conda package manager specializing in bioinformatics software. For those interested in learning more about the Bioconda project, checkout the project’s GitHub page.

My package isn’t available on the defaults channel! What should I do?

It may very well be the case that packages (or often more recent versions of packages!) that you need to install for your project are not available on the defaults channel. In this case you should try the following.

  1. conda-forge: the conda-forge channel contains a large number of community curated conda packages. Typically the most recent versions of packages that are generally available via the defaults channel are available on conda-forge first.
  2. pip: only if a package is not otherwise available via conda-forge (or some domain-specific channel like bioconda) should a package be installed into a conda environment from PyPI using pip.

For example, Kaggle publishes a Python 3 API that can be used to interact with Kaggle datasets, kernels and competition submissions. You can search for the package on the defaults channels but you will not find it!

$ conda search kaggle
Loading channels: done
No match found for: kaggle. Search: *kaggle*

PackagesNotFoundError: The following packages are not available from current channels:

  - kaggle

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

The official installation instructions suggest downloading the kaggle package using pip. But since we are using conda we should check whether the package exists on at least conda-forge channel before proceeding to use pip.

$ conda search conda-forge::kaggle
Loading channels: done
# Name                       Version           Build  Channel             
kaggle                         1.5.3          py27_1  conda-forge         
kaggle                         1.5.3          py36_1  conda-forge         
kaggle                         1.5.3          py37_1  conda-forge         
kaggle                         1.5.4          py27_0  conda-forge         
kaggle                         1.5.4          py36_0  conda-forge         
kaggle                         1.5.4          py37_0  conda-forge         

Once we know that the kaggle package is available via conda-forge we can go ahead and install it! Note that we are explicitly providing both the channel to use when installing the kaggle package as well as a specific version number.

$ conda install conda-forge::kaggle=1.5.4  --prefix ./env

For the moment let us suppose that the kaggle package was not avaiable on conda-forge. Here is how we would install the package into our environment using pip.

  1. Use conda to install pip into our environment (if necessary).
  2. Activate the enviroment (if necessary).
  3. Use pip to install kaggle
$ conda install pip --prefix ./env
$ source activate ./env
$ pip install $SOME_PACKAGE 

Installing via pip in environment.yml files

Since you write environment.yml files for all of your projects, you might be wondering how to specify that packages should be installed using pip in the environment.yml file. Here is an example environment.yml file that uses pip to install the kaggle and yellowbrick packages.

name: null

dependencies:
 - jupyterlab=1.0
 - matplotlib=3.1
 - pandas=0.24
 - scikit-learn=0.21
 - pip=19.1
 - pip:
   - kaggle=1.5
   - yellowbrick=0.9

Note that you should include pip itself as a dependency and then a sub-section denoting those packages to be installed via pip. Also in case you are wondering, The Yellowbrick package is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. Yellowbrick can also be installed using conda from the districtdatalabs channel.

$ conda install --channel districtdatalabs yellowbrick=0.9 --prefix ./env

What actually happens when I install packages?

During the installation process, files are extracted into the specified environment (defaulting to the current environment if none is specified). Installing the files of a conda package into an environment can be thought of as changing the directory to an environment, and then downloading and extracting the package and its dependencies.

For example, when you conda install a package that exists in a channel and has no dependencies, conda does the following.

  1. looks at your configured channels (in priority)
  2. reaches out to the repodata associated with your channels/platform
  3. parses repodata to search for the package
  4. once the package is found, conda pulls it down and installs

The [conda documentation][conda-install-docs] has a nice decision tree that describes the package installation process.

Installing with Conda

Specifying channels when installing packages

Like many projects, PyTorch has its own channel on Anaconda Cloud. This channel has several interesting packages, in particular pytorch (PyTorch core) and torchvision (datasets, transforms, and models specific to computer vision).

Create a new directory called my-computer-vision-project and then create a Python 3.6 environment in a sub-directory called env/ with the three packages listed above. Also include the most recent version of jupyterlab in your environment (so you have a nice UI) and matplotlib (so you can make plots).

Solution

In order to create a new environment you use the conda create command as follows.

$ mkdir my-computer-vision-project
$ cd my-computer-vision-project/
$ conda create --prefix ./env --channel pytorch \
> python=3.6 \
> jupyterlab=1.0 \
> pytorch=1.1 \
> torchvision=0.3 \
> matplotlib=3.1

Alternative syntax for installing packages from specific channels

There exists an alternative syntax for installing conda packages from specific channels that more explicitly links the channel being used to install a particular package.

$ conda install conda-forge::tensorflow  --prefix ./env

Repeat the previous exercise using this alternative syntax to install python, jupyterlab, and matplotlib from the conda-forge channel and pytorch and torchvision from the pytorch channel.

Solution

One possibility would be to use the conda create command as follows.

$ mkdir my-computer-vision-project
$ cd my-computer-vision-project/
$ conda create --prefix ./env \
> conda-forge::python=3.6 \
> conda-forge::jupyterlab=1.0 \
> conda-forge::matplotlib=3.1
> pytorch::pytorch=1.1 \
> pytorch::torchvision=0.3

Key Points

  • A package is a tarball containing system-level libraries, Python or other modules, executable programs and other components, and associated metadata.

  • A Conda channel is a URL to a directory containing a Conda package(s).

  • Explicitly including the channels (and their priority!) in a project’s environment file is necessary for another researcher to completely re-create that project’s software environment.


Managing GPU dependencies

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • Which NVIDIA libraries are available via Conda?

  • What do you do when you need the NVIDIA CUDA Compiler (NVCC) for your project?

Objectives
  • Show how to use Conda to manage key GPU dependencies for you next (data) science project.

  • Show how to identify which versions of CUDA packages are available via Conda.

  • Understand how to write a Conda environment file for a project with GPU dependencies.

  • Understand when you need the NVIDIA CUDA Compiler (NVCC) and how to handle this situation.

Getting familiar with NVIDIA CUDA libraries

Transitioning your (data) science projects from CPU to GPU can seem like a daunting task. In particular, there is quite a bit of unfamiliar additional software, such as NVIDIA CUDA Toolkit, NVIDIA Collective Communications Library (NCCL), and NVIDIA Deep Neural Network Library (cuDNN) to download and install.

If you go to the NVIDIA developer website you will find loads of documentation and instructions for how to install these libraries system wide. But then what do you do if you need different versions of these new libraries for different projects? You could install a bunch of different versions of NVIDIA CUDA Toolkit, NCCL, and cuDNN system wide and then use environment variables to control the “active” versions for each project but this is cumbersome and error prone. Fortunately there are better ways!

In this episode we are going to see how to manage project specific versions of the NVIDIA CUDA Toolkit, NCCL, and cuDNN using Conda.

Are NVIDIA libraries available via Conda?

Yep! The most important NVIDIA CUDA library that you will need is the NVIDIA CUDA Toolkit. The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications. The toolkit includes GPU-accelerated libraries, debugging and optimization tools and a runtime library. You can use the conda search command to see what versions of the NVIDIA CUDA Toolkit are available from the default channels.

$ conda search cudatoolkit

Loading channels: done

# Name                       Version           Build  Channel
cudatoolkit                      9.0      h13b8566_0  pkgs/main
cudatoolkit                      9.2               0  pkgs/main
cudatoolkit                 10.0.130               0  pkgs/main
cudatoolkit                 10.1.168               0  pkgs/main
cudatoolkit                 10.1.243      h6bb024c_0  pkgs/main
cudatoolkit                  10.2.89      hfd86e86_0  pkgs/main
cudatoolkit                  10.2.89      hfd86e86_1  pkgs/main

NVIDIA actually maintains their own Conda channel and the versions of CUDA Toolkit available from the default channels are the same as those you will find on the NVIDIA channel. If you are interested in confirming this you can run the command

$ conda search --channel nvidia cudatoolkit

and then compare the build numbers with those listed above from the default channels.

The CUDA Toolkit packages available from defaults do not include NVCC

An important limitation of the versions of the NVIDIA CUDA Toolkit that are available from the either the default or NVIDIA Conda channels is that they do not include the NVIDIA CUDA Compiler (NVCC).

What about cuDNN?

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

If you are interested in deep learning, then you will need to get your hands on cuDNN. Various versions of cuDNN are available from the default channels.

$ conda search cudnn

Loading channels: done

# Name                       Version           Build  Channel
cudnn                          7.0.5       cuda8.0_0  pkgs/main
cudnn                          7.1.2       cuda9.0_0  pkgs/main
cudnn                          7.1.3       cuda8.0_0  pkgs/main
cudnn                          7.2.1       cuda9.2_0  pkgs/main
cudnn                          7.3.1      cuda10.0_0  pkgs/main
cudnn                          7.3.1       cuda9.0_0  pkgs/main
cudnn                          7.3.1       cuda9.2_0  pkgs/main
cudnn                          7.6.0      cuda10.0_0  pkgs/main
cudnn                          7.6.0      cuda10.1_0  pkgs/main
cudnn                          7.6.0       cuda9.0_0  pkgs/main
cudnn                          7.6.0       cuda9.2_0  pkgs/main
cudnn                          7.6.4      cuda10.0_0  pkgs/main
cudnn                          7.6.4      cuda10.1_0  pkgs/main
cudnn                          7.6.4       cuda9.0_0  pkgs/main
cudnn                          7.6.4       cuda9.2_0  pkgs/main
cudnn                          7.6.5      cuda10.0_0  pkgs/main
cudnn                          7.6.5      cuda10.1_0  pkgs/main
cudnn                          7.6.5      cuda10.2_0  pkgs/main
cudnn                          7.6.5       cuda9.0_0  pkgs/main
cudnn                          7.6.5       cuda9.2_0  pkgs/main

What about NCCL?

If you are already accelerating your (data) science workflows with a GPU, then in the near future you will probably be interested in using more than one GPU. The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.

There are some older versions of NCCL available from the default channels but these versions will not be useful (unless, perhaps, you are forced to use very old versions of TensorFlow or similar).

$ conda search nccl

Loading channels: done

# Name                       Version           Build  Channel
nccl                           1.3.5      cuda10.0_0  pkgs/main
nccl                           1.3.5       cuda9.0_0  pkgs/main
nccl                           1.3.5       cuda9.2_0  pkgs/main

Not to worry: Conda Forge to the rescue! Conda Forge is a community-led collection of recipes, build infrastructure and distributions for the Conda package manager. I always check the conda-forge channel when I can’t find something I need available on the default channels.

Which version of NCCL are available via Conda Forge?

Find out which versions of the NVIDIA Collective Communications Library (NCCL) are available via Conda Forge?

Solution

Use the conda search command with the --channel conda-forge option.

$ conda search --channel conda-forge nccl

Loading channels: done

# Name                       Version           Build  Channel
nccl                           1.3.5      cuda10.0_0  pkgs/main
nccl                           1.3.5       cuda9.0_0  pkgs/main
nccl                           1.3.5       cuda9.2_0  pkgs/main
nccl                         2.4.6.1      h51cf6c1_0  conda-forge
nccl                         2.4.6.1      h7cc98d6_0  conda-forge
nccl                         2.4.6.1      hc6a2c23_0  conda-forge
nccl                         2.4.6.1      hd6f8bf8_0  conda-forge
nccl                         2.4.7.1      h51cf6c1_0  conda-forge
nccl                         2.4.7.1      h7cc98d6_0  conda-forge
nccl                         2.4.7.1      hd6f8bf8_0  conda-forge
nccl                         2.4.8.1      h51cf6c1_0  conda-forge
nccl                         2.4.8.1      h51cf6c1_1  conda-forge
nccl                         2.4.8.1      h7cc98d6_0  conda-forge
nccl                         2.4.8.1      h7cc98d6_1  conda-forge
nccl                         2.4.8.1      hd6f8bf8_0  conda-forge
nccl                         2.4.8.1      hd6f8bf8_1  conda-forge
nccl                         2.5.6.1      h51cf6c1_0  conda-forge
nccl                         2.5.6.1      h7cc98d6_0  conda-forge
nccl                         2.5.6.1      hc6a2c23_0  conda-forge
nccl                         2.5.6.1      hd6f8bf8_0  conda-forge
nccl                         2.5.7.1      h51cf6c1_0  conda-forge
nccl                         2.5.7.1      h7cc98d6_0  conda-forge
nccl                         2.5.7.1      hc6a2c23_0  conda-forge
nccl                         2.5.7.1      hd6f8bf8_0  conda-forge
nccl                         2.6.4.1      h51cf6c1_0  conda-forge
nccl                         2.6.4.1      h7cc98d6_0  conda-forge
nccl                         2.6.4.1      hc6a2c23_0  conda-forge
nccl                         2.6.4.1      hd6f8bf8_0  conda-forge

Some example Conda environment files

Now that you know how to figure out which versions of the various NVIDIA CUDA libraries are available on which channels you are ready to write your environment.yml file. In this section I will provide some example Conda environment files for PyTorch, TensorFlow, and NVIDIA RAPIDS to help get you started on your next GPU data science project.

PyTorch

PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. It is primarily developed by Facebook’s AI Research lab. Conda is actually the recommended way to install PyTorch. The official PyTorch binary ships with NCCL and cuDNN so it is not necessary to include these libraries in your environment.yml file (unless some other package also needs these libraries!).

name: null

channels:
  - pytorch
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit=10.1
  - pip=20.0
  - python=3.7
  - pytorch=1.5
  - torchvision=0.6

Check you channel priorities!

Also take note of the channel priorities: the official pytorch channel must be given priority over conda-forge in order to insure that the official PyTorch binaries (the ones that include NCCL and cuDNN) will be installed (otherwise you will get some unofficial version of PyTorch available on conda-forge).

TensorFlow

TensorFlow is a free, open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. There are lots of versions and builds of TensorFlow available via Conda (the output of conda search tensorflow is too long to share here!).

How do you decide which version is the “correct” version? How to make sure that you get a build that includes GPU support? At this point you have seen all the Conda “tricks” required to solve this one yourself!

Create an environment.yml file for TensorFlow

In this exercise you will create a Conda environment for TensorFlow. Important CUDA depedencies of TensorFlow are the CUDA Toolkit, cuDNN, and CUPTI. Don’t forget that if you want to train on more than one GPU, then your environment will need also need NCCL and an MPI implementation.

Solution

Use the tensorflow-gpu meta-package to select the appropriate version and build of TensorFlow for your OS; use mpi4py to get a CUDA-aware OpenMPI build.

name: null

channels:
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit=10.1
  - cudnn=7.6
  - cupti=10.1
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.4
  - pip=20.0
  - python=3.7
  - tensorflow-gpu=2.1 # installs tensorflow=2.1=gpu_py37h7a4bb67_0

NVIDIA RAPIDS (+BlazingSQL+Datashader)

As a final example let’s see how to get started with NVIDIA RAPIDS (and friends!). NVIDIA RAPIDS is a suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs (think Pandas + Scikit-learn but for GPUs instead of CPUs). In addition to RAPIDS we have also included BlazingSQL in this example environment file. BlazingSQL is an open-source SQL interface to extract-transform-load (ETL) massive datasets directly into GPU memory for analysis using NVIDIA RAPIDS. The environment also includes Datashader, a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly, that can be accelerated with GPUs. If you are going to do all of your data analysis on the GPU, then you might as well do your data visualization on the GPU too!

name: null 

channels:  
  - blazingsql
  - rapidsai
  - nvidia
  - conda-forge
  - defaults

dependencies:  
  - blazingsql=0.13
  - cudatoolkit=10.1
  - datashader=0.10
  - pip=20.0
  - python=3.7
  - rapids=0.13

What exactly is installed when you install NVIDIA RAPIDS?

Create an NVIDIA RAPIDS environment using the Conda environment file above. Use Conda commands to inspect the complete list of packages that have been installed. Do you recognize any of the packages?

Solution

Use the following commands to create the environment.

$ mkdir nvidia-rapids-project
$ cd nvidia-rapids-project/
$ nano environment.yml # copy-paste the environment file contents
$ conda env create --prefix ./env --file environment.yml

Use the following commands to activate the environment and list all the packages installed.

$ conda activate ./env
$ conda list

But what if I need the NVIDIA CUDA Compiler?

Way up at the beginning of this episode I mention that the versions of the NVIDIA CUDA Toolkit available via the default channels did not include the NVIDIA CUDA Compiler (NVCC). However, if your data science project has dependencies that require compiling custom CUDA extensions then you will almost surely need the NVIDIA CUDA Compiler (NVCC).

How do you know that your project dependencies need NVCC? Most likely you will try the standard approaches above and Conda will fail to successfully create the environment and perhaps throw a bunch of compiler errors. Once you know that you need NVCC, there are a couple of ways to get the NVCC compiler installed.

First, try the cudatoolkit-dev package…

The cudatoolkit-dev package available from the conda-forge channel includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler and a runtime library. This package consists of a post-install script that downloads and installs the full CUDA toolkit (NVCC compiler and libraries, but not the exception of CUDA drivers).

While the cudatoolkit-dev packages available from conda-forge do include NVCC, I have had difficult getting these packages to consistently install properly.

That said, it is always worth trying the cudatoolkit-dev approach first. Maybe it will work for your use case. Here is an example that uses NVCC installed via the cudatoolkit-dev package to compile custom extensions from PyTorch Cluster, a small extension library of highly optimized graph cluster algorithms for the use with PyTorch.

name: null

channels:
  - pytorch
  - conda-forge
  - defaults

dependencies:
  - cudatoolkit-dev=10.1
  - cxx-compiler=1.0
  - matplotlib=3.2
  - networkx=2.4
  - numba=0.48
  - pandas=1.0
  - pip=20.0
  - pip:
    - -r file:requirements.txt
  - python=3.7
  - pytorch=1.4
  - scikit-image=0.16
  - scikit-learn=0.22
  - tensorboard=2.1
  - torchvision=0.5

The requirements.txt file referenced above contains PyTorch Cluster and related packages such as PyTorch Geometric. Here is what that file looks like.

torch-scatter==2.0.*
torch-sparse==0.6.*
torch-spline-conv==1.2.*
torch-cluster==1.5.*
torch-geometric==1.4.*

# make sure the following are re-compiled if environment is re-built
--no-binary=torch-scatter
--no-binary=torch-sparse
--no-binary=torch-spline-conv
--no-binary=torch-cluster
--no-binary=torch-geometric

The use of the --no-binary option here insures that the packages with custom extensions will be re-built whenever the environment is re-built which helps increase the reproducibility of the environment build process when porting from workstations to remote clusters that might have different OS.

…if that doesn’t work, then use the nvcc_linux-64 meta-package

The most robust approach to obtain NVCC and still use Conda to manage all the other dependencies is to install the NVIDIA CUDA Toolkit on your system and then install a meta-package nvcc_linux-64 from conda-forge which configures your Conda environment to use the NVCC installed on your system together with the other CUDA Toolkit components installed inside the Conda environment. While we have found this approach to be more robust then relying on cudatoolkit-dev, this approach is more involved as it requires installing a particular version of the NVIDIA CUDA Toolkit on your system first.

In order to demonstrate how to use the nvcc_linux-64 meta-package approach we will show how to build a Conda environment for deep learning projects that use Horovod to enable distributed training across multiple GPUs (either on the same node or spread across multiple nodes).

Horovod is an open-source distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Originally developed by Uber for in-house use, Horovod was open sourced a couple of years ago and is now an official Linux Foundation AI (LFAI) project.

Typical environment.yml file

Let’s checkout the environment.yml file. You can find this environment.yml file on GitHub as a part of a template repository to help you get started with Horovod.

name: null

channels:
  - pytorch
  - conda-forge
  - defaults

dependencies:
  - bokeh=1.4
  - cmake=3.16 # insures that Gloo library extensions will be built
  - cudnn=7.6
  - cupti=10.1
  - cxx-compiler=1.0 # insures C and C++ compilers are available
  - jupyterlab=1.2
  - mpi4py=3.0 # installs cuda-aware openmpi
  - nccl=2.5
  - nodejs=13
  - nvcc_linux-64=10.1 # configures environment to be "cuda-aware"
  - pip=20.0
  - pip:
    - mxnet-cu101mkl==1.6.* # MXNET is installed prior to horovod
    - -r file:requirements.txt
  - python=3.7
  - pytorch=1.4
  - tensorboard=2.1
  - tensorflow-gpu=2.1
  - torchvision=0.5

Take note of the channel priorities: pytorch is given highest priority in order to insure that the official PyTorch binary is installed (and not the binaries available on conda-forge). There are also a few things worth noting about the dependencies. Even though you have installed the NVIDIA CUDA Toolkit manually you can still use Conda to manage the other required CUDA components such as cudnn and nccl (and the optional cupti).

Typical requirements.txt file

The requirements.txt file is where all of the dependencies, including Horovod itself, are listed for installation via pip. In addition to Horovod, it is recommended to use pip to install JupyterLab extensions to enable GPU and CPU resource monitoring via jupyterlab-nvdashboard and Tensorboard support via jupyter-tensorboard. Note the use of the --no-binary option at the end of the file. Including this option insures that Horovod will be re-built whenever the Conda environment is re-built.

horovod==0.19.*
jupyterlab-nvdashboard==0.2.*
jupyter-tensorboard==0.2.*

# make sure horovod is re-compiled if environment is re-built
--no-binary=horovod

Creating the Conda environment

You can create the Conda environment in a sub-directory env of your project directory by running the following commands. By default Horovod will try and build extensions for all detected frameworks. See the Horovod documentation on for the details on additional environment variables that can be set prior to building Horovod.

export ENV_PREFIX=$PWD/env
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
conda env create --prefix $ENV_PREFIX --file environment.yml --force

Wrap complex Conda environment builds in a script!

In order to enhance reproducibiity of your complex Conda build, I typically wrap commands into a shell script called create-conda-env.sh. Running the shell script will set the Horovod build variables, create the Conda environment, activate the Conda environment, and build JupyterLab with any additional extensions as specified in a postBuild script.

#!/bin/bash --login

set -e

export ENV_PREFIX=$PWD/env
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
conda env create --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
. postBuild

You can put scripts inside a bin directory in my project root directory. The script should be run from the project root directory as follows.

$ ./bin/create-conda-env.sh

We covered a lot of ground in this episode! We showed you how to use conda search to see which versions of the NVIDIA CUDA Toolkit and related libraries such as NCCL and cuDNN were available via Conda. Then we walked you through example Conda environment files for several popular data science frameworks that can use GPUs. We wrapped up with a discussion of two different approaches for getting NVCC when your project requires compiler custom CUDA extensions. Hopefully these ideas will help you make the jump from CPUs to GPUs on your next data science project!

Key Points

  • Conda can be used to manage your key GPU dependencies.

  • Use conda search to identify which version of CUDA libraries are available.

  • For most projects you will not need NVCC and can use the cudatoolkit package from default channels.

  • If your project does need NVCC, try cudatoolkit-dev package or nvcc_linux-64 meta-package (requires separate NVIDIA CUDA Toolkit install).