Link Search Menu Expand Document

Step 13 - Files as input/output

Let us tackle a different angle now and start to deal with files as input and output. In order to do this, we will mimic the functionality from earlier and modify it such that a file is used as input and output is also written to a file.

The following combination of process and workflow definition does exactly the same as before, but now from one or more files containing just a single integer number:

// Step - 13
process process_step13 {
  input:
    tuple val(id), file(input), val(config)
  output:
    tuple val("${id}"), file("output.txt"), val("${config}")
  script:
    """
    a=`cat $input`
    let result="\$a + ${config.term}"
    echo "\$result" > output.txt
    """
}
workflow step13 {
  Channel.fromPath( params.input ) \
    | map{ el ->
      [
        el.baseName.toString(),
        el,
        [ "operator" : "-", "term" : 10 ]
      ]} \
    | process_step13 \
    | view{ [ it[0], it[1] ] }
}

While doing this, we also introduced a way to specify parameters via a configuration file (nextflow.config) or from the CLI. In this case params.input points to an argument we should provide on the CLI, for instance:

> nextflow -q run . -entry step13 --input data/input1.txt
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
[input1, <...>/work/17/bde087db08bba02e1a2dcce41c8892/output.txt]

Let’s dissect what is going on here…

  1. We provide the input file data/input1.txt as input which gets automatically added to the params map as params.input.
  2. The content of input1.txt is used in the simple sum just as before.
  3. The output Channel contains the known triplet but this time the second entry is not a value, but rather a filename.

Please note that the file is output.txt is automatically stored in the (unique) work directory. We can take a look inside to verify that the calculation succeeded:

> cat $(nextflow log | cut -f 3 | tail -1 | xargs nextflow log)/output.txt
11

It seems the calculation went well, although one might be surprised by two things:

  1. The output of the calculation is stored in some randomly generated work directory whereas we might want it somewhere more findable.
  2. The process itself defines the value of the output filename, which may seem odd… and it is.

Taking our example a bit further and exploiting the fact that parallelism is natively supported by NextFlow as we’ve seen before, we can pass multiple input files to the same workflow defined above.

> nextflow -q run . -entry step13 --input "data/input?.txt"
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
[input3, <...>/work/7b/1d1d756ff5c66dbc56daa6793ec324/output.txt]
[input1, <...>/work/6c/c09325a112f2a749ec10cacd222e47/output.txt]
[input2, <...>/work/0e/7b5a91a27313b0218ac5fcf8f2b09a/output.txt]

Please note that we

  1. provide the absolute path to the file
  2. use a wildcard * to select multiple files
  3. enclose the path (with wildcard) in double quotes to avoid shell globbing.

In the latter case, we end up with 3 output files, each named output.txt in their own respective (unique) work directory.

> nextflow log | cut -f 3 | tail -1 | xargs nextflow log | xargs ls
<...>/work/0e/7b5a91a27313b0218ac5fcf8f2b09a:
input2.txt
output.txt

<...>/work/6c/c09325a112f2a749ec10cacd222e47:
input1.txt
output.txt

<...>/work/7b/1d1d756ff5c66dbc56daa6793ec324:
input3.txt
output.txt