Link Search Menu Expand Document

Toward implementation

What is needed as information in order to run a computation step in a pipeline?

  1. First, we need data or generally speaking, input. Components/modules and pipelines should run zero-touch, so input has to be provided at startup time.

  2. Secondly, we need to know what to run en how to run it. This is in effect the definition of a modules or pipeline step.

  3. Thirdly, in many cases we will require the possibility to change parameters for individual modules in the pipeline, for instance cutoff values for a filter, or the number of clusters for a clustering algorithm. The classical way to do that is via the params object.

One might wonder if there is a difference between input and parameters pointing to input is also a kind of parametrization. The reason those are kept apart is that additional validation steps are necessary for the data. Most pipeline systems trace input/output closely whereas parameters are ways to configure the steps in the pipeline.

In terms of FRP, and especially in the DataFlow model, we also have to keep track of the forks in a parallel execution scenario. For instance, if 10 batches of data can be processed in parallel we should give all 10 of them an ID so that individual forks can be distinguished. We will see that those IDs become crucial in most pipelines.

We end up with a model for a stream/channel as follows (conceptually):

[ ID, data, config ]

were

  • ID is just a string or any object for that matter that can be compared later. We usually work with strings.
  • data is a pointer to the (input) data. With NextFlow, this should be a Path object, ideally created using the file() helper function.
  • config is a nested Map where the first level keys are chosen to be simply an identifier of the pipeline step. Other approaches can be taken here, but that’s what we did.

This can be a triplet, or a list with mixed types. In Groovy, both can be used interchangeably.

The output of a pipeline step/mudules adheres to the same structure so that pipeline steps can easily be chained.