Nextflow coding practices
Last updated on 2023-12-08 | Edit this page
Overview
Questions
- How do I make my code readable?
- How do I make my code portable?
- How do I make my code maintainable?
Objectives
- Learn how to use whitespace and comments to improve code readability.
- Understand coding pitfalls that reduce portability.
- Understand coding pitfalls that reduce maintainability.
Nextflow coding practices
Nextflow is a powerful flexible language that one can code in a variety of ways. This can lead to poor practices in coding. For example, this can lead to the workflow only working under certain configurations or execution platforms. Alternatively, it can make it harder for someone to contribute to a codebase, or for you to amend two years later for article submission. These are some useful coding tips that make maintaining and porting your workflow easier.
Use whitespace to improve readability.
Nextflow is generally not sensitive to whitespace in code. This allows you to use indentation, vertical spacing, new-lines, and increased spacing to improve code readability.
GROOVY
! /usr/bin/env nextflow
#
// Tip: Allow spaces around assignments ( = )
.enable.dsl = 2
nextflow
// Tip: Separate blocks of code into groups with common purpose
// e.g., parameter blocks, include statements, workflow blocks, process blocks
// Tip: Align assignment operators vertically in a block
.reads = ''
params.gene_list = ''
params.gene_db = 'ftp://path/to/database'
params
// Tip: Align braces or instruction parts vertically
{ BAR } from 'modules/bar'
include { TAN as BAZ } from 'modules/tan'
include
{
workflow
// Tip: Indent process calls
// Tip: Use spaces around process/function parameters
FOO ( Channel.fromPath( params.reads, checkIfExists: true ) )
BAR ( FOO.out )
// Tip: Use vertical spacing and indentation for many parameters.
BAZ (
Channel.fromPath( params.gene_list, checkIfExists: true ),
.out,
FOO.out,
BARfile( params.gene_db, checkIfExists: true )
)
}
// Tip: Uppercase process names help readability
{
process FOO
// Tip: Separate process parts into distinct blocks
:
input
path fastq
:
output"*.fasta"
path
:
script= fastq.baseName
prefix """
tofasta $fastq > $prefix.fasta
"""
}
Improve the workflow readability
Use whitespace to improve the readability of the following code.
GROOVY
! /usr/bin/env nextflow
#
.enable.dsl=2
nextflow.reads = ''
params{
workflow foo(Channel.fromPath(params.reads))
bar(foo.out)
}
{
process foo :
input
path fastq:
output"*.fasta"
path :
script=fastq.baseName
prefix"""
tofasta $fastq > $prefix.fasta
"""
}
{
process bar :
input
path fasta:
script"""
fastx_check $fasta
"""
}
GROOVY
! /usr/bin/env nextflow
#
.enable.dsl = 2
nextflow
.reads = ''
params
{
workflow FOO ( Channel.fromPath( params.reads ) )
BAR ( FOO.out )
}
{
process FOO
:
input
path fastq
:
output"*.fasta"
path
:
script= fastq.baseName
prefix """
tofasta $fastq > $prefix.fasta
"""
}
{
process BAR
:
input
path fasta
:
script"""
fastx_check $fasta
"""
}
Use comments
Comments are an important tool to improve readability and maintenance. Use them to:
- Annotate data structures expected in a channel.
- Describe higher level functionality.
- Describe presence/absence of (un)expected code.
- Mandatory and optional process inputs.
GROOVY
{
workflow ALIGN_SEQ
:
take// queue channel; [ sample_id, [ file(read1), file(read2) ] ]
reads // file( "path/to/reference" )
reference
:
main// Quality Check input reads
READ_QC ( reads )
// Align reads to reference
Channel.empty()
.set { aligned_reads_ch }
if( params.aligner == 'hisat2' ){
ALIGN_HISAT2 ( READ_QC.out.reads, reference )
.mix( ALIGN_HISAT2.out.bam )
aligned_reads_ch.set { aligned_reads_ch }
} else if ( params.aligner == 'star' ) {
ALIGN_STAR ( READ_QC.out.reads, reference )
.mix( ALIGN_STAR.out.bam )
aligned_reads_ch.set { aligned_reads_ch }
}
.view()
aligned_reads_ch
:
emit= aligned_reads_ch // queue channel: [ sample_id, file(bam_file) ]
bam
}
{
process COUNT_KMERS
:
input// Mandatory
val(sample), path(reads) // [ 'sample_id', [ read1, read2 ] ]: Reads in which to count kmers
tuple // Optional
// 'path/to/kmer_table': Table of k-mers to count
path kmer_table
...
}
Report tool versions
Software packaging is a hard problem, and it can be difficult for a package to report the versions of all the tools it has. It may also be excessive to report the version of everything included in a package, when only a handful of tools are used. This means that it’s up to us to effectively report the versions of the tools we use to aid reproducibility.
GROOVY
{
process HISAT2_ALIGN
...
:
scriptdef HISAT2_VERSION = '2.2.0' // Version not available using command-line
"""
hisat2 ... | samtools ...
cat <<-END_VERSIONS > versions.yml
"${task.process}":
hisat2: $HISAT2_VERSION
samtools: \$( samtools --version 2>&1 | sed 's/^.*samtools //; s/Using.*\$//' )
END_VERSIONS
"""
}
Name output channels
Output channels from processes and workflows can be named using the
emit:
keyword, which helps readability.
GROOVY
{
workflow ALIGN_HISAT2
...
:
emit= HISAT2_ALIGN.out.bam
alignment
}
{
process HISAT2_ALIGN
...
:
outputval(sample), path("*.bam"), emit: bam
tuple val(sample), path("*.log"), emit: summary
tuple "versions.yml" , emit: versions
path
...
}
Use params.parameters in workflow blocks, not in process blocks
The params
variables are accessible from anywhere in a
workflow. They can be useful to provide a wide variety of properties and
decision making options. For example, one could use a
params.aligner
variable in a workflow to select a
particular alignment tool. This in turn could be coded like:
GROOVY
{
process ALIGN
:
inputval(sample), path(reads)
tuple
path index
...
:
scriptif ( params.aligner == 'hisat2' ) {
"""
hisat2 ... | samtools ...
...
"""
} else if ( params.aligner == 'star' ){
"""
star ...
...
"""
}
}
A better practice is to use it as an input value.
GROOVY
{
process ALIGN
:
inputval(sample), path(reads)
tuple
path index// params.aligner is provided as a third parameter
val aligner
...
:
scriptif ( aligner == 'hisat2' ) {
"""
hisat2 ... | samtools ...
...
"""
} else if ( aligner == 'star' ){
"""
star ...
...
"""
}
}
This allows one to see from the workflow
block where all
parameters are being used, making the workflow easier to maintain. There
is also a danger that one could modify params
variables
during pipeline execution, potentially leading to unreproducible results
in more complex workflows.
All input files/directories should be a process input
Depending on the platform you execute your workflow, files may be easily accessible over the network, or downloadable from the internet. However not all execution platforms support this. The example below could work well on your system, but fail on another (for example compute nodes without internet connection).
GROOVY
{
process READ_CHECK
:
inputval(sample), path(reads)
tuple
...
:
script"""
wget ftp://path/to/database
check_reads $reads /local/copy/database > $sample.report
...
"""
}
A strength of Nextflow is file staging, i.e., preparing files for use in process tasks. Staging files by providing them as process input has several benefits.
- Files are updated only when necessary.
- A single file/folder can be shared, without downloading multiple times as in the example above.
- Nextflow supports retrieving files from any valid URL, meaning potentially fewer lines of code.
GROOVY
{
process READ_CHECK
:
inputval(sample), path(reads)
tuple
path database
...
:
script"""
check_reads $reads $database > $sample.report
...
"""
}
If may be that a file is an optional input depending on other
parameters. In cases when no file should be provided, one can pass an
empty list []
instead.
workflow {
COUNT_KMERS ( reads, [] )
}
process COUNT_KMERS {
input:
// Mandatory
tuple val(sample), path(reads) // [ 'sample_id', [ read1, read2 ] ]: Reads in which to count kmers
// Optional
path kmer_table // 'path/to/kmer_table': Table of k-mers to count
...
}
Avoid lots of short running processes
Many execution platforms are inefficient if a workflow tries to
execute many short running processes. It can take more time to schedule
and request resources for each small instance than bundling the short
processes into a larger process task. Nextflow provides convenient
channel operators, such as buffer
, collate
,
collect
, and collectFile
, that help group
together inputs into batches which can run for longer with a given
requested resource. The short tasks themselves can also be parallelised
inside a process script using the command-line tools xargs
or parallel
.
GROOVY
{
workflow REFINE_DATA
:
take
datapoints
:
mainBATCH_TASK ( datapoints.collate(100) )
}
{
process BATCH_TASK
:
input
val data
:
script"""
printf "%s\\n" $data | \
xargs -P $task.cpus -I {} \
short_task {}
# Alternative:
# parallel --jobs $task.cpus "short_task {1}" :::: $data
"""
}
Include a test profile
A test
profile is a configuration profile that specifies
a short running test data set to check the functionality of the whole
pipeline. It can also demonstrate to users of your workflow the kinds of
inputs and outputs to expect. Another benefit is the possibility of
automated testing of your workflow, ensuring the workflow keeps working
as you add new functionality.
GROOVY
{
profiles {
test {
params = 'https://github.com/my_repo/test/test_reads.fastq.gz'
reads = 'https://github.com/my_repo/test/test_reference.fasta.gz'
reference }
}
}
Write modules that use existing containers
Using containers for software packaging is strongly recommended, as they are intended to operate the same, irrespective of the operating system it runs on. Writing modules which use existing containers reduces maintenance needed for a workflow, and minimises the need to resolve package conflicts. A helpful resource for this is the bioconda channel for the package manager conda, which provides packaging for many bioinformatics tools. In addition to this, Biocontainers builds both Docker and Singularity images for each tool packaged on the bioconda channel. Multi-package containers (known as mulled containers) can also be created following the instructions on the Multi Package Containers repository.
process FASTQC {
container "${ workflow.containerEngine == 'singularity' ?
'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' :
'quay.io/biocontainers/fastqc:0.11.9--0' }"
...
}
Building your own container images should be used as a last resort. A preferred option is to write a conda recipe for the tool to be included in the bioconda channel. This makes the tool available via a package manager, and containers are automatically built for the tool.
Use file compression and temporary disk space when possible
Disk space is often a valuable resource on most compute systems. When
possible, work with compressed files. There are several useful shell
commands and operations that work well with compressed data, such as
gunzip -c
and zgrep
. Operations such as
command substitution ($( command )
), or process
substitution (>( command_list )
, or
<( command_list)
) can also help working with compressed
data. Lastly, named pipes can also be used if the above approaches
fail.
GROOVY
{
process COUNT_FASTA
:
input// reference.fasta.gz
path fasta
:
script"""
zgrep -c '^>' $fasta
"""
}
Another good practice is to use local temporary disk space (also
known as scratch space). Often, the workDir is located on a shared disk
space over a network, which slows down processes that read and write a
lot to disk. Using scratch space not only speeds up disk I/O, but also
saves space in the workDir since only files which match the process
output directive are copied back across for caching. The process
directive process.scratch
can be provided with either a
boolean or the path to use for scratch space.
GROOVY
{
process = '/tmp'
scratch }
Use consistent naming conventions
Using consistent naming conventions greatly helps readability. For example using uppercase for process names helps distinguish them from other workflow components like channels or operators. Here are other suggestions one can follow from Nf-core.
Key Points
- Nextflow is not sensitive to whitespace. Use it to layout code for readability.
- Use comments and whitespace to group chunks of code to describe big picture functionality.
- Report tool versions in the scripts.
- Name channel outputs using the
emit:
keyword. - Avoid
params.parameter
in a process. Pass all parameters using input channels. - Input files should be passed using input channels.
- Group short running commands into a larger process.
- Include a test profile which runs the workflow on a small test data set.
- Write your processes to reuse existing containers/software bundles.
- Use compressed files and temporary disk space when possible.
- Use consistent naming conventions.