Step 22 - Toward generic processes

Dealing with computational pipelines, and looking at the examples above and beyond, we note that the input of a process is always the same: a triplet. The output should at least contain the ID and the path to the output file or directory. We want to provide the ConfigMap as output as well, so that input and output become consistent and we can easily chain processes.

Going a step further, we might reflect on the nature of the script-part of a process definition. It contains one or more commands, each with options. For the sake of the argument, let’s say we need to run one command. We already know how we can provide parameters for input and output. We can now also go a step further.

We could, for instance, provide the full command line instruction via the ConfigMap:

// Step - 22
process process_step22 {
    publishDir "output"
    input:
        tuple val(id), file(input), val(config)
    output:
        tuple val("${id}"), file("${config.output}"), val("${config}")
    script:
        """
        ${config.cli}
        """
}
workflow step22 {
    Channel.fromPath( params.input ) \
        | map{ el -> [
            el.baseName.toString(),
            el,
            [
                "cli": "cat input.txt > output22.txt",
                "output": "output22.txt"
            ]
          ]} \
        | process_step22 \
        | view{ [ it[0], it[1] ] }
}
//- - -

Such that

> nextflow -q run . -entry step22 --input data/input.txt
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
[input, <...>/work/cf/0822ca4f216cd2d616d25bc279eaf4/output22.txt]

Unsurprisingly, the content of output22.txt is the same as that of input.txt:

> cat output/output22.txt
1

This may seem silly, but let us make a few remarks anyway:

The output file name is specified in two places, that is not a good idea
The input file name is specified explicitly in the cli definition. We could get around by by pointing to params.input instead, keeping in mind correct escaping and such. It could work, but would be error prone.
One could be tempted (we were) to indeed create 1 generic process handler, but when looking at a pipeline run, one would not be able to distinguish the different processes from each other because the name of the process is used as an identifier.

So, while this may seem like a stupid thing to do, we use a method that is very similar to this in DiFlow. Keeping into account the above points, that is…

In practice, we do not specify the command line like shown above, but rather by specifying the command and options by means of the ConfigMap for that specific process. Let us give an example:

params {
  ...
  cellranger {
    name = "cellranger"
    container = "mapping/cellranger:4.0.0-1"
    command = "cellranger"
    arguments {
      mode {
        name = "mode"
        otype = "--"
        description = "The run mode"
        value = "count"
        ...
      }
      input {
        name = "input"
        otype = "--"
        description = "Path of folder created by mkfastq or bcl2fastq"
        required = false
        ...
      }
      id {
        name = "id"
        otype = "--"
        description = "Output folder for the count files."
        value = "cr"
        required = false
        ...
      }
      ...
    }
    ...
  }

In reality, this is still a simplified version because we also use variables in nextflow.config but that is only for convenience.

The above is a representation of the command-line instruction to be provided inside a container (mapping/cellranger:4.0.0-1). The command itself is cellranger and the different options are listed as keys under arguments.

Every DiFlow module has its own nextflow.config that contains a representation of the CLI instruction to run as well as a pointer to the container to be used for running.

We have a function in NextFlow that takes params.cellranger and creates the CLI that corresponds to it. Values for input and output are set during the pipeline run based on the input provided and a closure for generating the output file name. To give an idea of what this CLI rendering looks like, this is what we use in DiFlow:

A nextflow.config file with content like above is created for every module, i.e., for every processing step in the pipeline. On the level of the pipeline those config files are sourced, i.e.:

includeConfig '<...>/nextflow.config'

It may seem like a daunting task to create a config file like this for every computational step, and it is. We are not doing this manually as that would be too error-prone and frustrating on top of that. We are just laying out the principles here, later we will see how viash can create the nextflow.config file for us.

Transforming the relevant section in nextflow.config to a command-line instruction is done by a Groovy function that simply parses the ConfigMap.