This lesson is in the early stages of development (Alpha version)

Using Packages and Channels

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • What are Conda channels?

  • What are Conda packages?

  • Why should I be explicit about which channels my research project uses?

Objectives
  • Install a package from a specific channel.

What are Conda packages?

A [conda package][conda-pkg-docs] is a compressed tarball file (.tar.bz2) that contains:

Conda keeps track of the dependencies between packages and platforms; the conda package format is identical across platforms and operating systems.

Package Structure

All conda packages have a specific sub-directory structure inside the tarball file. There is a bin directory that contains any binaries for the package; a lib directory containing the relevant library files (i.e., the .py files); and an info directory containing package metadata. For a more details of the conda package specification, including discussions of the various metadata files, see the [docs][conda-pkg-spec-docs].

As an example of Conda package structure consider the Conda package for Python 3.6 version of PyTorch targeting a 64-bit Mac OS, pytorch-1.1.0-py3.6_0.tar.bz2.

.
├── bin
│   └── convert-caffe2-to-onnx
│   └── convert-onnx-to-caffe2
├── info
│   ├── LICENSE.txt
│   ├── about.json
│   ├── files
│   ├── git
│   ├── has_prefix.json
│   ├── hash_input.json
│   ├── index.json
│   ├── paths.json
│   ├── recipe/
│   └── test/
└── lib
    └── python3.6
        └── site-packages
            ├── caffe2/
            ├── torch/
            └── torch-1.1.0-py3.6.egg-info/

A complete listing of available PyTorch packages can be found on Anaconda Cloud.

What are Conda channels?

Again from the [Conda documentation][conda-channels-docs], conda packages are downloaded from remote channels, which are URLs to directories containing conda packages. The conda command searches a default set of channels, and packages are automatically downloaded and updated from the Anaconda Cloud channels.

Collectively, the Anaconda managed channels are referred to as the defaults channel because, unless otherwise specified, packages installed using conda will be downloaded from these channels.

The conda-forge channel

In addition to the default channels that are managed by Anaconda Inc., there is another channel called that also has a special status. The Conda-Forge project “is a community led collection of recipes, build infrastructure and distributions for the conda package manager.”

There are a few reasons that you may wish to use the conda-forge channel instead of the defaults channel maintained by Anaconda:

  1. Packages on conda-forge may be more up-to-date than those on the defaults channel.
  2. There are packages on the conda-forge channel that aren’t available from defaults.
  3. You may wish to use a dependency such as openblas (from conda-forge) instead of mkl (from defaults).

How do I install a package from a specific channel?

You can install a package from a specific channel into the currently activate environment by passing the --channel option to the conda install command as follows.

$ conda install scipy=1.3 --channel conda-forge

You can also install a package from a specific channel into a named environment (using --name) or into an environment installed at a particular prefix (using --prefix). For example, the following command installs the scipy package from the conda-forge channel into the environment called my-first-conda-env which we created eariler.

$ conda install scipy=1.3 --channel conda-forge --name my-first-conda-env

This command would install tensorflow package from conda-forge channel into an environment installed into the env/ sub-directory.

$ conda install tensorflow=1.13 --channel conda-forge --prefix ./env

Here is another example for R users. The following command would install r-tidyverse package from the r channel into an environment installed into the env/ sub-directory.

$ conda install r-tidyverse=1.2 --channel r --prefix ./env

In this case the --channel option is unnecessary because the r channel is included by default. The following works just as well!

$ conda install r-tidyverse=1.2 --prefix ./env

Channel priority

You may specify multiple channels for installing packages by passing the --channel argument multiple times.

$ conda install scipy=1.3 --channel conda-forge --channel bioconda

Channel priority decreases from left to right - the first argument has higher priority than the second. For reference, bioconda is a channel for the conda package manager specializing in bioinformatics software. For those interested in learning more about the Bioconda project, checkout the project’s GitHub page.

My package isn’t available on the defaults channel! What should I do?

It may very well be the case that packages (or often more recent versions of packages!) that you need to install for your project are not available on the defaults channel. In this case you should try the following.

  1. conda-forge: the conda-forge channel contains a large number of community curated conda packages. Typically the most recent versions of packages that are generally available via the defaults channel are available on conda-forge first.
  2. pip: only if a package is not otherwise available via conda-forge (or some domain-specific channel like bioconda) should a package be installed into a conda environment from PyPI using pip.

For example, Kaggle publishes a Python 3 API that can be used to interact with Kaggle datasets, kernels and competition submissions. You can search for the package on the defaults channels but you will not find it!

$ conda search kaggle
Loading channels: done
No match found for: kaggle. Search: *kaggle*

PackagesNotFoundError: The following packages are not available from current channels:

  - kaggle

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

The official installation instructions suggest downloading the kaggle package using pip. But since we are using conda we should check whether the package exists on at least conda-forge channel before proceeding to use pip.

$ conda search conda-forge::kaggle
Loading channels: done
# Name                       Version           Build  Channel             
kaggle                         1.5.3          py27_1  conda-forge         
kaggle                         1.5.3          py36_1  conda-forge         
kaggle                         1.5.3          py37_1  conda-forge         
kaggle                         1.5.4          py27_0  conda-forge         
kaggle                         1.5.4          py36_0  conda-forge         
kaggle                         1.5.4          py37_0  conda-forge         

Once we know that the kaggle package is available via conda-forge we can go ahead and install it! Note that we are explicitly providing both the channel to use when installing the kaggle package as well as a specific version number.

$ conda install conda-forge::kaggle=1.5.4  --prefix ./env

For the moment let us suppose that the kaggle package was not avaiable on conda-forge. Here is how we would install the package into our environment using pip.

  1. Use conda to install pip into our environment (if necessary).
  2. Activate the enviroment (if necessary).
  3. Use pip to install kaggle
$ conda install pip --prefix ./env
$ source activate ./env
$ pip install $SOME_PACKAGE 

Installing via pip in environment.yml files

Since you write environment.yml files for all of your projects, you might be wondering how to specify that packages should be installed using pip in the environment.yml file. Here is an example environment.yml file that uses pip to install the kaggle and yellowbrick packages.

name: null

dependencies:
 - jupyterlab=1.0
 - matplotlib=3.1
 - pandas=0.24
 - scikit-learn=0.21
 - pip=19.1
 - pip:
   - kaggle=1.5
   - yellowbrick=0.9

Note that you should include pip itself as a dependency and then a sub-section denoting those packages to be installed via pip. Also in case you are wondering, The Yellowbrick package is a suite of visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. Yellowbrick can also be installed using conda from the districtdatalabs channel.

$ conda install --channel districtdatalabs yellowbrick=0.9 --prefix ./env

What actually happens when I install packages?

During the installation process, files are extracted into the specified environment (defaulting to the current environment if none is specified). Installing the files of a conda package into an environment can be thought of as changing the directory to an environment, and then downloading and extracting the package and its dependencies.

For example, when you conda install a package that exists in a channel and has no dependencies, conda does the following.

  1. looks at your configured channels (in priority)
  2. reaches out to the repodata associated with your channels/platform
  3. parses repodata to search for the package
  4. once the package is found, conda pulls it down and installs

The [conda documentation][conda-install-docs] has a nice decision tree that describes the package installation process.

Installing with Conda

Specifying channels when installing packages

Like many projects, PyTorch has its own channel on Anaconda Cloud. This channel has several interesting packages, in particular pytorch (PyTorch core) and torchvision (datasets, transforms, and models specific to computer vision).

Create a new directory called my-computer-vision-project and then create a Python 3.6 environment in a sub-directory called env/ with the three packages listed above. Also include the most recent version of jupyterlab in your environment (so you have a nice UI) and matplotlib (so you can make plots).

Solution

In order to create a new environment you use the conda create command as follows.

$ mkdir my-computer-vision-project
$ cd my-computer-vision-project/
$ conda create --prefix ./env --channel pytorch \
> python=3.6 \
> jupyterlab=1.0 \
> pytorch=1.1 \
> torchvision=0.3 \
> matplotlib=3.1

Alternative syntax for installing packages from specific channels

There exists an alternative syntax for installing conda packages from specific channels that more explicitly links the channel being used to install a particular package.

$ conda install conda-forge::tensorflow  --prefix ./env

Repeat the previous exercise using this alternative syntax to install python, jupyterlab, and matplotlib from the conda-forge channel and pytorch and torchvision from the pytorch channel.

Solution

One possibility would be to use the conda create command as follows.

$ mkdir my-computer-vision-project
$ cd my-computer-vision-project/
$ conda create --prefix ./env \
> conda-forge::python=3.6 \
> conda-forge::jupyterlab=1.0 \
> conda-forge::matplotlib=3.1
> pytorch::pytorch=1.1 \
> pytorch::torchvision=0.3

Key Points

  • A package is a tarball containing system-level libraries, Python or other modules, executable programs and other components, and associated metadata.

  • A Conda channel is a URL to a directory containing a Conda package(s).

  • Explicitly including the channels (and their priority!) in a project’s environment file is necessary for another researcher to completely re-create that project’s software environment.