Using Packages and Channels
Overview
Teaching: 20 min
Exercises: 10 minQuestions
What are Conda channels?
What are Conda packages?
Why should I be explicit about which channels my research project uses?
What should I do if a Python package isn’t available via a Conda channel?
Objectives
Install a package from a specific channel.
What are Conda packages?
A conda package is a compressed archive file (.tar.bz2
) that contains:
- system-level libraries
- Python or other modules
- executable programs and other components
- metadata under the
info/
directory - a collection of files that are installed directly into an
install
prefix.
Conda keeps track of the dependencies between packages and platforms; the conda package format is identical across platforms and operating systems.
Package Structure
All conda packages have a specific sub-directory structure inside the tarball file. There is a
bin
directory that contains any binaries for the package; a lib
directory containing the
relevant library files (i.e., the .py
files); and an info
directory containing package metadata.
For a more details of the conda package specification, including discussions of the various
metadata files, see the [docs][conda-pkg-spec-docs].
As an example of Conda package structure consider the Conda package for
Python 3.6 version of PyTorch targeting a 64-bit Mac OS, pytorch-1.1.0-py3.6_0.tar.bz2
.
.
├── bin
│ └── convert-caffe2-to-onnx
│ └── convert-onnx-to-caffe2
├── info
│ ├── LICENSE.txt
│ ├── about.json
│ ├── files
│ ├── git
│ ├── has_prefix.json
│ ├── hash_input.json
│ ├── index.json
│ ├── paths.json
│ ├── recipe/
│ └── test/
└── lib
└── python3.6
└── site-packages
├── caffe2/
├── torch/
└── torch-1.1.0-py3.6.egg-info/
A complete listing of available PyTorch packages can be found on Anaconda Cloud.
What are Conda channels?
Again from the Conda documentation, conda packages are downloaded from
remote channels, which are URLs to directories containing conda packages. The conda
command
searches a default set of channels, and packages are automatically downloaded and updated from the
Anaconda Cloud channels.
main
: The majority of all new Anaconda, Inc. package builds are hosted here. Included in conda’s defaults channel as the top priority channel.r
: Microsoft R Open conda packages and Anaconda, Inc.’s R conda packages. This channel is included in conda’s defaults channel. When creating new environments, MRO is now chosen as the default R implementation.
Collectively, the Anaconda managed channels are referred to as the defaults
channel because,
unless otherwise specified, packages installed using conda
will be downloaded from these
channels.
The
conda-forge
channelIn addition to the
default
channels that are managed by Anaconda Inc., there is another channel called that also has a special status. The Conda-Forge project “is a community led collection of recipes, build infrastructure and distributions for the conda package manager.”There are a few reasons that you may wish to use the
conda-forge
channel instead of thedefaults
channel maintained by Anaconda:
- Packages on
conda-forge
may be more up-to-date than those on thedefaults
channel.- There are packages on the
conda-forge
channel that aren’t available fromdefaults
.- You may wish to use a dependency such as
openblas
(fromconda-forge
) instead ofmkl
(fromdefaults
).
How do I install a package from a specific channel?
You can install a package from a specific channel into the currently activate environment by
passing the --channel
option to the conda install
command as follows.
$ conda activate machine-learning-env
$ conda install scipy=1.6 --channel conda-forge
You can also install a package from a specific channel into a named environment (using --name
)
or into an environment installed at a particular prefix (using --prefix
). For example, the
following command installs the scipy
package from the conda-forge
channel into the environment
called my-first-conda-env
which we created earlier.
$ conda install scipy=1.6 --channel conda-forge --name machine-learning-env
This command would install tensorflow
package from conda-forge
channel into an environment
installed into the env/
sub-directory.
$ conda install tensorflow=1.14 --channel conda-forge --prefix ./env
Here is another example for R users. The following command would install
r-tidyverse
package from the conda-forge
channel into an
environment installed into the env/
sub-directory.
$ cd ~/Desktop/introduction-to-conda-for-data-scientists
$ conda install r-tidyverse=1.3 --channel conda-forge --prefix ./env
Channel priority
You may specify multiple channels for installing packages by passing the
--channel
argument multiple times.$ conda install scipy=1.6 --channel conda-forge --channel bioconda
Channel priority decreases from left to right - the first argument has higher priority than the second. For reference, bioconda is a channel for the conda package manager specializing in bioinformatics software. For those interested in learning more about the Bioconda project, checkout the project’s GitHub page.
Please note that in our example, adding
bioconda
channel is irrelevant becausescipy
is no longer available onbioconda
channel.
My package isn’t available on the defaults
channel! What should I do?
It may very well be the case that packages (or often more recent versions of packages!) that you need to
install for your project are not available on the defaults
channel. In this case you should try the
following.
conda-forge
: theconda-forge
channel contains a large number of community curated conda packages. Typically the most recent versions of packages that are generally available via thedefaults
channel are available onconda-forge
first.bioconda
: thebioconda
channel also contains a large number of Bioinformatics curated conda packages.bioconda
channel is meant to be used withconda-forge
, you should not worried about using the two channels when installing your prefered packages.pip
: only if a package is not otherwise available viaconda-forge
(or some domain-specific channel likebioconda
) should a package be installed into a conda environment from PyPI usingpip
.
For example, Kaggle publishes a Python 3 API that can be used to interact with Kaggle datasets, kernels and competition submissions. You can search for the package on the defaults
channels but you will not find it!
$ conda search kaggle
Loading channels: done
No match found for: kaggle. Search: *kaggle*
PackagesNotFoundError: The following packages are not available from current channels:
- kaggle
Current channels:
- https://repo.anaconda.com/pkgs/main/osx-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/free/osx-64
- https://repo.anaconda.com/pkgs/free/noarch
- https://repo.anaconda.com/pkgs/r/osx-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
The official installation instructions suggest downloading
the kaggle
package using pip
. But since we are using conda
we should check whether the
package exists on at least conda-forge
channel before proceeding to use pip
.
$ conda search --channel conda-forge kaggle
Loading channels: done
# Name Version Build Channel
kaggle 1.5.3 py27_1 conda-forge
kaggle 1.5.3 py36_1 conda-forge
kaggle 1.5.3 py37_1 conda-forge
kaggle 1.5.4 py27_0 conda-forge
kaggle 1.5.4 py36_0 conda-forge
kaggle 1.5.4 py37_0 conda-forge
.
.
.
Or you can also check online at https://anaconda.org/conda-forge/kaggle.
Once we know that the kaggle
package is available via conda-forge
we can go ahead and install
it.
$ conda install --channel conda-forge kaggle=1.5.10 --prefix ./env
What actually happens when I install packages?
During the installation process, files are extracted into the specified environment (defaulting to the current environment if none is specified). Installing the files of a conda package into an environment can be thought of as changing the directory to an environment, and then downloading and extracting the package and its dependencies.
For example, when you conda install
a package that exists in a channel and has no dependencies,
conda does the following.
- looks at your configured channels (in priority)
- reaches out to the repodata associated with your channels/platform
- parses repodata to search for the package
- once the package is found, conda pulls it down and installs
The [conda documentation][conda-install-docs] has a nice decision tree that describes the package installation process.
Specifying channels when installing packages
Like many projects, PyTorch has its own channel on Anaconda Cloud. This channel has several interesting packages, in particular
pytorch
(PyTorch core) andtorchvision
(datasets, transforms, and models specific to computer vision).Create a new directory called
my-computer-vision-project
and then create a Python 3.6 environment in a sub-directory calledenv/
with the two packages listed above. Also include the most recent version ofjupyterlab
in your environment (so you have a nice UI) andmatplotlib
(so you can make plots).Solution
In order to create a new environment you use the
conda create
command as follows.$ mkdir my-computer-vision-project $ cd my-computer-vision-project/ $ conda create --prefix ./env --channel pytorch \ python=3.6 \ jupyterlab=1.0 \ pytorch=1.1 \ torchvision=0.3 \ matplotlib=3.1
Hint: For the lazy typers: the
--channel
argument can also be shortened to-c
, for more abbreviations, see also the Conda command reference .
Alternative syntax for installing packages from specific channels
There exists an alternative syntax for installing conda packages from specific channels that more explicitly links the channel being used to install a particular package.
$ conda install conda-forge::tensorflow --prefix ./env
Create a new folder
my-final-project
in~/Desktop/introduction-to-conda-for-data-scientists
and repeat the previous exercise using this alternative syntax to installpython
,jupyterlab
, andmatplotlib
from theconda-forge
channel andpytorch
andtorchvision
from thepytorch
channel.Solution
One possibility would be to use the
conda create
command as follows.$ cd ~/Desktop/introduction-to-conda-for-data-scientists $ mkdir my-final-project $ cd my-final-project/ $ conda create --prefix ./env \ conda-forge::python=3.6 \ conda-forge::jupyterlab=1.0 \ conda-forge::matplotlib=3.1 \ pytorch::pytorch=1.1 \ pytorch::torchvision=0.3
A Python package isn’t available on any Conda channel! What should I do?
If a Python package that you need isn’t available on any Conda channel, then you can use the default Python package manager Pip to install this package from PyPI. However, there are a few potential issues that you should be aware of when using Pip to install Python packages when using Conda.
First, Pip is sometimes installed by default on operating systems where it is used to
manage any Python packages needed by your OS. You do not want to use this pip
to
install Python packages when using Conda environments.
(base) $ conda deactivate
$ which python
/usr/bin/python
$ which pip # sometimes installed as pip3
/usr/bin/pip
Second, Pip is also included in the Miniconda installer where it is used to install and
manage OS specific Python packages required to setup your base Conda environment. You
do not want to use this pip
to install Python packages when using Conda environments.
$ conda activate
(base) $ which python
~/miniconda3/bin/python
$ which pip
~/miniconda3/bin/pip
Another reaon to avoid installing packages into your
base
Conda environmentIf your
base
Conda environment becomes cluttered with a mix of Pip and Conda installed packages it may no longer function. Creating separate conda environments allows you to delete and recreate environments readily so you dont have to worry about risking your core Conda functionality when mixing packages installed with Conda and Pip.
If you find yourself needing to install a Python package that is only available via Pip, then
you should first install pip
into your Conda environment and then use that pip
to install
the desired package. Using the pip
installed in your Conda environment to install Python packages
not available via Conda channels will help you avoid difficult to debug issues that frequently
arise when using Python packages installed via a pip
that was not installed inside you Conda
environment.
Conda (+Pip): Conda wherever possible; Pip only when necessary
When using Conda to manage environments for your Python project it is a good idea to install packages available via both Conda and Pip using Conda; however there will always be cases where a package is only available via Pip in which case you will need to use Pip. Many of the common pitfalls of using Conda and Pip together can be avoided by adopting the following practices.
- Always explicitly install
pip
in every Python-based Conda environment.- Always be sure your desired environment is active before installing anything using
pip
.- Prefer
python -m pip install
overpip install
; never usepip
with the--user
argument.
Installing packages into Conda environments using
pip
Combo is a comprehensive Python toolbox for combining machine learning models and scores. Model combination can be considered as a subtask of ensemble learning, and has been widely used in real-world tasks and data science competitions like Kaggle.
Activate the
machine-learning-env
you created in a previous challenge and usepip
to installcombo
.Solution
The following commands will activate the
basic-scipy-env
and installcombo
.$ conda install --name machine-learning-env pip $ conda activate machine-learning-env $ python -m pip install combo==0.1.*
For more details on using
pip
see the official documentation.
Key Points
A package is a tarball containing system-level libraries, Python or other modules, executable programs and other components, and associated metadata.
A Conda channel is a URL to a directory containing a Conda package(s).
Explicitly including the channels (and their priority!) in a project’s environment file is necessary for another researcher to completely re-create that project’s software environment.
Understand how to use Conda and Pip together effectively.