General Requirements and design principles

Reproducibility

I originally did not include it as a design principle for the simple reason that I think it’s obvious. This should be every researcher’s top priority.

Pipeline Parameters vs Runtime Parameters

We make a strict distinction between parameters that are defined for the FULL pipeline and those that are defined at runtime.

Pipeline Parameters

We currently have 4 pipeline parameters: Docker prefix, ddir, rdir and pdir.

Runtime Parameters

Runtime parameters differ from pipeline parameters in that they may be different for parallel runs of a process. A few examples:

Some samples may require different filter threshold than others
After concatenation, clustering may be run with different cluster parameters
etc.

In other words, it does not make sense to define those parameters for the full pipeline because they are not static.

Consistent API

When we started out with the project and chose to use NextFlow as a workflow engine, I kept on thinking that the level of abstraction should have been higher. With DSL1, all you could do was create one long list of NextFlow code, tied together by Channels.

With DSL2, it became feasible to organise stuff in separate NextFlow files and import what is required. But in larger codebases, this is not really a benefit because every modules/workflow may have its own parameters and output. No structure is imposed. Workflows are basically functions taking parameters in and returning values.

I think it makes sense to define an API and to stick to it as much as possible. This makes using the modules/workflows easier…

Flat Module Structure

We want to avoid having nested modules, but rather support a pool of modules to be mixed and matched.

As a consequence, this allows a very low threshold for including third-party modules: just add it to the collection of modules and import it in the pipeline. In order to facilitate the inclusion of such third-party modules that are developed in their own respective repositories, we added one additional layer in the hierarchy allowing for such a splitting.

Job Serialization

We avoid requiring the sources of the job available in the runtime environment, i.e., the Docker container. In other words, all code and config is serialized and sent with the process.