Nextflow Workflows

Overview

Teaching: 30 min
Exercises: 0 min

Questions

How do I run a Nextflow pipeline?

What is a teminal multiplexer?

How do I debug a Nextflow job?

How do I resume a pipeline?

Objectives

Understand the best way to run Nextflow pipelines

Know how to debug a Nextflow Error

How to run a Nextflow pipeline

Nextflow pipelines are launched from the command line like so:

nextflow run hello_world.nf

You can also put your Nextflow pipelines on your PATH and treat them as you would other executables as long as they have executable permissions and the following shebang on the first line:

#!/usr/bin/env nextflow

Then running your pipeline becomes as easier and can be launched from the command line like so:

hello_world.nf

Which will have the output:

N E X T F L O W  ~  version 22.03.1-edge
Launching `/home/nick/code/Nextflow_Training_2022B/code/hello_world.nf` [jolly_descartes] DSL2 - revision: 582ccc5833
executor >  local (2)
[94/4b9b86] process > make_file [100%] 1 of 1 ✔
[88/adf190] process > echo_file [100%] 1 of 1 ✔
HELLO WORLD

You can see from the output that the location of the script is shown next to Launching.

Using terminal multiplexers

Nextflow pipelines do not run in the background by default, so it is best to use a terminal multiplexer if running a long pipeline. Terminal multiplexers allow you to have multiple windows within a single terminal window. The benefit of these for running Nextflow pipelines is that you can detach from these terminal windows and reattach them later (even through an ssh connection) to check on the pipeline’s progress.

The two most common terminal multiplexers are screen and tmux. We advise using tmux when possible as it is just as easy to use and has some excellent features but will also explain how to use screen.

Tmux

Installing tmux on Ubuntu is as easy as running:

sudo apt-get install tmux

You can start a tmux session by simply running:

tmux

tmux_0

We are now in a tmux session which has been named 0. By default, tmux will number your sessions from 0. You can access the tmux commands by pressing Ctrl+b. To kill the session, you can use Ctrl+d or Ctrl+b then x. To detach from the screen, use Ctrl+b then d.

To reattach to the screen using the command:

tmux attach-session -t <session_name>

So for our session:

tmux attach-session -t 0

Using the default numbered names will quickly get confusing, so we should give our sessions meaningful names instead. For example, if you want to launch a download job, you could make a named session with:

tmux new -s download

tmux_download

You can see at the bottom of the screen that we are now on a session called download. We can detach (with Ctrl+b then d) and reattach (with tmux attach-session -t download), and you will still see the same display as when you detached.

You can now make several sessions with different names and switch back and forth between them to check the progress of your pipelines. If you forget the names of your sessions you can use the following command to list them:

tmux ls

Screen

screen is similar to tmux, so we will quickly go over the equivalent commands.

To make a named session use:

screen -S <screen_name>

There isn’t a bar at the bottom of your new session like in tmux, so you will have to remember if you’re within a session or not. To perform screen commands, you first press Ctrl+a then to detach and click d.

Then you can reattach it with

screen -r <screen_name>

You can use

screen -r

To attach to your sessions if you only have one or list all of your sessions.

Where are each of these jobs running?

Nextflow jobs are all run within a work directory. By default, the work directory is ./work but it can be altered by setting

workDir = "/data/dir/for/work"

in the nextflow.config file or on the command line with -w /data/dir/for/work.

For each job that nextflow runs, it will create a subdirectory and run the job there. Let’s use one of the outputs of our simple hello_world.nf script as an example:

N E X T F L O W  ~  version 22.03.1-edge
Launching `/home/nick/code/Nextflow_Training_2022B/code/hello_world.nf` [nostalgic_kare] DSL2 - revision: 582ccc5833
executor >  local (2)
[23/9bf45d] process > make_file [100%] 1 of 1 ✔
[9d/9c5ecd] process > echo_file [100%] 1 of 1 ✔
HELLO WORLD

The characters on the left between the square brackets describe the most recent launched job for that process. So if we want to investigate the make_file job, we can move into its subdirectory like so:

cd work/23/9bf45d111e14368b2438367f6813c2/

Where I have copied 23/9bf45d and then used the tab to complete the 30-character hex that Nextflow uses to create the subdirectory. Let’s see what is within the subdirectory:

ls

message.txt

We can see the message.txt file we created in our process. If we want to investigate the files that Nextflow creates, we must look at the hidden files:

ls -a

.
..
.command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.exitcode
message.txt

Let us go through what each of these files does:

.command.begin: This file is created once the job has begun (no longer in a queue)

.exitcode: Once the job has complete, this file will be created with the exit code (0 means finished successfully and other numbers are errors)

.command.out: The standard output (stdout) of the job

.command.err: The standard error (stderr) of the job

.command.log: Both the standard output and error of the job. This file is very useful for debugging as it will contain all outputs of the code. You can use tail -f .command.log to see what the job is outputting in real-time.

.command.sh: This is the code that your job is going to run. It will look like the code from your process with all the values of your attributes.

.command.run: “Here there be monsters”. This file contains all the magic that Nextflow uses to run your job. You do not have to understand what it is doing, but when using Nextflow on a supercomputing cluster, there are a few useful things in this file including: the SLURM SBATCH commands in the header, the modules you load, and the container command (singularity or docker).

Finding jobs using a trace report

If you find a job that wasn’t the most recent, the easiest way to find it using the trace report. To output a a trace report use the -with-trace command line option like so:

nextflow run <pipeline name> -with-trace

This will output a file named trace.txt which will output a tab seperated file but for readability I have converted it to a table:

task_id	hash	native_id	name	status	submit	duration	realtime	%cpu	peak_rss	peak_vmem	rchar	wchar
2	77/9d803f	30901	get_version	CACHED	2022-04-12 16:04:06.504	832ms	5ms	76.9%	0	0	55 KB	229 B
3	d7/b13e5b	330369	bane_raw (1)	CACHED	2022-04-14 15:33:31.335	1.8s	1s	95.2%	11.1 MB	16.1 MB	8.1 MB	91.2 KB
4	fc/1941a3	330938	bane_raw (2)	CACHED	2022-04-14 15:33:32.308	1.5s	966ms	99.9%	11.1 MB	16.1 MB	8.1 MB	91.2 KB
5	be/adec6e	331297	bane_raw (3)	CACHED	2022-04-14 15:33:33.181	1.5s	961ms	99.9%	10.7 MB	16.1 MB	8.1 MB	91.2 KB
6	c2/ba3e2e	348200	initial_sfind (1)	COMPLETED	2022-04-14 16:14:31.258	3.4s	2.6s	92.8%	166.8 MB	2 GB	33.8 MB	96.1 KB
8	5c/67bac7	348867	initial_sfind (3)	COMPLETED	2022-04-14 16:14:34.678	3s	2.4s	96.3%	166.6 MB	2 GB	33.8 MB	96 KB
7	30/29f5ee	349396	initial_sfind (2)	COMPLETED	2022-04-14 16:14:37.689	3.1s	2.5s	94.0%	166.9 MB	2 GB	33.8 MB	96 KB
1	22/f4e019	348175	download_gleam_catalogue	COMPLETED	2022-04-14 16:14:31.240	18.3s	17.5s	25.0%	264.7 MB	1.1 GB	9.8 MB	44.3 MB
9	db/d98083	350074	fits_warp (1)	COMPLETED	2022-04-14 16:14:49.549	7.1s	5.6s	84.2%	305.7 MB	5.2 GB	55.3 MB	776.3 KB
10	23/fb45cc	350082	fits_warp (2)	COMPLETED	2022-04-14 16:14:49.552	7.2s	5.6s	85.3%	305.2 MB	5.2 GB	55.3 MB	775.9 KB
11	6a/48bd36	352365	fits_warp (3)	COMPLETED	2022-04-14 16:14:56.643	6.5s	5.3s	87.7%	547.6 MB	9.8 GB	55.3 MB	826.4 KB
12	29/304cf8	353616	make_mean_image	COMPLETED	2022-04-14 16:15:03.106	749ms	117ms	98.8%	3.2 MB	5.6 MB	677.3 KB	408.3 KB
13	e9/fe10ba	353955	bane_mean_image	COMPLETED	2022-04-14 16:15:03.865	2.2s	1.6s	94.9%	102.4 MB	3 GB	8.1 MB	102.5 KB
14	05/3b2508	354339	sfind_mean_image	COMPLETED	2022-04-14 16:15:06.029	3.6s	3s	96.6%	167.2 MB	2 GB	33.8 MB	99.2 KB
15	2d/10b294	354796	mask_images (1)	COMPLETED	2022-04-14 16:15:09.677	2.4s	1.4s	93.1%	103.9 MB	1.7 GB	21.4 MB	45.6 KB
17	06/8f74e6	354808	mask_images (2)	COMPLETED	2022-04-14 16:15:09.683	2.5s	1.5s	94.9%	96.8 MB	1.7 GB	21.4 MB	45.6 KB
20	e8/cc1f75	355540	mask_images (3)	COMPLETED	2022-04-14 16:15:12.093	2s	1.3s	95.7%	118.6 MB	1.7 GB	21.4 MB	45.6 KB
16	ec/a0a5b8	355611	source_monitor (1)	COMPLETED	2022-04-14 16:15:12.189	4.3s	3.7s	96.8%	165.4 MB	2 GB	44.4 MB	118.5 KB
18	6d/f0d725	356204	source_monitor (3)	COMPLETED	2022-04-14 16:15:14.077	4.2s	3.6s	96.8%	166.6 MB	18.3 GB	44.4 MB	118.5 KB
19	d9/c3773c	356757	source_monitor (2)	COMPLETED	2022-04-14 16:15:16.499	4.5s	3.8s	96.5%	166.8 MB	18.3 GB	44.4 MB	118.5 KB
21	c9/7bd0b6	357308	sfind_masked (1)	COMPLETED	2022-04-14 16:15:18.233	4.6s	4s	96.2%	167.4 MB	2 GB	56 MB	102.1 KB
22	a7/115d00	357951	sfind_masked (2)	COMPLETED	2022-04-14 16:15:20.968	4.5s	3.8s	96.5%	165.8 MB	2 GB	56 MB	90.7 KB
24	c3/ea3979	358988	join_fluxes	COMPLETED	2022-04-14 16:15:25.461	1.3s	698ms	98.4%	11.1 MB	16.1 MB	9.1 MB	22.5 KB
23	87/5fd3d6	358416	sfind_masked (3)	COMPLETED	2022-04-14 16:15:22.854	4.9s	4.3s	97.1%	166.6 MB	2 GB	56 MB	90.8 KB
26	b7/03301d	359719	compile_transients_candidates	COMPLETED	2022-04-14 16:15:27.770	1.2s	677ms	95.2%	10.7 MB	16.1 MB	8.9 MB	17.3 KB
25	f3/6253e7	359393	compute_stats	COMPLETED	2022-04-14 16:15:26.781	2.6s	1.9s	95.4%	41.8 MB	851.8 MB	23.3 MB	21.5 KB
27	2f/2f1f3b	360080	transients_plot	COMPLETED	2022-04-14 16:15:28.993	2.3s	1.6s	97.8%	70.5 MB	1 GB	21.1 MB	172.9 KB
28	fc/33d0e0	360250	plot_lc	COMPLETED	2022-04-14 16:15:29.353	16.1s	15.4s	93.9%	190.8 MB	2.1 GB	452.9 MB	1.6 MB

This can be used to find the hashes of all jobs with a certain name, with an FAILED status or baseh on the duration etc.

How to debug a Nextflow Job

To explain how to debug Nextflow, let us make a simple Nextflow pipeline with an intentional error:

process python_job {
    output:
        stdout

    """
    #!/usr/bin/env python
    for i in range(3):
        print i
    """
}

workflow {
   python_job()
}

When I run this pipeline, Nextflow will output a verbose error message:

N E X T F L O W  ~  version 22.03.1-edge
Launching `error_check.nf` [loving_wescoff] DSL2 - revision: 44ac86534d
executor >  local (1)
executor >  local (1)
[28/ddb3ee] process > python_job [100%] 1 of 1, failed: 1 ✘
Error executing process > 'python_job'

Caused by:
  Process `python_job` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/env python
  for i in range(3):
      print i

Command exit status:
  1

Command output:
  (empty)

Command error:
    File ".command.sh", line 3
      print i
            ^
  SyntaxError: Missing parentheses in call to 'print'. Did you mean print(i)?

Work dir:
  /home/nick/code/Nextflow_Training_2022B/code/work/28/ddb3ee0334146697ecdb5c0d6df039
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

This gives you lots of useful information about the error, including which process caused the error, the command executed, the stderr and the work directory where the job was run. Just looking at this error, you can likely see that the error was using the Python 2 print formatting instead of Python 3. For such a simple example, you can likely make the fix and rerun the pipeline. If this was a more involved pipeline that takes hours to run, it is best to confirm you have fixed the problem for this job before rerunning the whole pipeline.

To debug the job, we first move into the work directory, which for me is:

cd /home/nick/code/Nextflow_Training_2022B/code/work/28/ddb3ee0334146697ecdb5c0d6df039

From here, we can edit the .command.sh file to fix the print bug like so:

#!/usr/bin/env python
for i in range(3):
    print(i)

You can then test that your fix works by running the job the same way that Nextflow does using the .command.run (not .command.sh) file like so:

bash .command.run

Or, if you are on a supercomputer and using a resource manager, you will have to launch .command.run using their executor, which for SLURM is:

sbatch .command.run

If you have fixed your job you should see the output:

0
1
2

Now that we are confident that we know how to fix the job, we can apply the same changes to the Nextflow pipeline and rerun the pipeline.

Resuming pipelines

One of the benefits of Nextflow is that you can resume pipelines that were manually stopped or stopped due to an error. We do this with the -resume argument. Note that the -resume argument has only a single dash because it is a Nextflow argument, not a variable assigned in the pipeline.

Nextflow keeps track of all the processes executed in your pipeline. If you modify some parts of your script, only the changed processes will be re-executed. Executing the processes that are not changed will be skipped and the cached result used instead.

To cache a process, the pipeline must be resumed from the same directory. This is because the .nextflow/ directory is created where you run the pipeline and is used to record what processes have already been executed.

To make sure your pipeline is resumable, make sure you don’t create any non-deterministic behaviour in your pipeline. For this reason, you should avoid the merge and mix channel operators.

If we resume our previous example now that we have fixed the error:

nextflow run error_check.nf -resume

N E X T F L O W  ~  version 22.03.1-edge
Launching `error_check.nf` [astonishing_kalam] DSL2 - revision: 1d0498de14
executor >  local (1)
[02/bb7b9f] process > python_job [100%] 1 of 1 ✔

You can see that it runs the process again because Nextflow knows it failed previously. If we rerun it again:

N E X T F L O W  ~  version 22.03.1-edge
Launching `error_check.nf` [trusting_wing] DSL2 - revision: 1d0498de14
[02/bb7b9f] process > python_job [100%] 1 of 1, cached: 1 ✔

You can now see that instead of rerunning job, it cached the one job.

How do I output files? publishDir

Unless specified, all files created by the pipeline will stay within the working directory. You can use publishDir to output files to a directory outside the Nextflow work directory. For example:

process final_data {
    publishDir '/home/data/'

    output:
    path 'science.data'

    '''
    echo "Some Science" > science.data
    '''
}

This will output the science.data file to /home/data/.

By default, publishDir outputs the files as symlinks so you will lose the published data once you delete the work directory. Instead you can use copy as the publishDir mode like so:

process final_data {
    publishDir '/home/data/', mode: 'copy'

    output:
    path 'science.data'

    '''
    echo "Some Science" > science.data
    '''
}

It is normally best to use copy instead of move as the pipeline will have to rerun the process if the output data is moved out of the work directory.

Key Points

You can install Nextflow scripts by putting them on the $PATH

Terminal multiplexers (screen and tmux) making running pipelines easier

Trace reports and error outputs help you find failed jobs

You can resume pipelines with -resume

previous episode

Nextflow Training

next episode