Batteries Included: Supercharging Bioinformatics Modules with Viash

Keywords

workflow automation, modular workflows, reproducible workflows, workflow development tools, batch processing, containerized workflows, automated testing workflows, pipeline development, scalable data workflows, workflow orchestration

Part 2: Batteries included: Supercharging Bioinformatics Modules with Viash

TL;DR: Viash comes with powerful built-in features that would normally require significant additional coding: parallel batch processing for speed, container management for reproducibility, and integrated testing for reliability. These “batteries included” features save you from writing hundreds of lines of boilerplate code.

In our previous post, we introduced how Viash simplifies bioinformatics tool management by transforming scripts into self-contained components. Now, let’s explore three powerful built-in capabilities that make Viash components truly production-ready.

Reliability: Integrated Testing

The Testing Challenge in Bioinformatics

Testing bioinformatics tools traditionally requires:

Writing custom test scripts
Managing test data
Setting up test environments
Tracking expected outputs

These tasks are often skipped due to time constraints, leading to unreliable tools and hard-to-track bugs. Viash solves this by making testing a first-class citizen in the component lifecycle.

Built-in Testing with Viash

Let’s get back to our SAMtools example from the previous post in this series. To add unit tests, we can simply add a test script alongside our script and Viash config, then update the config to include testing.

First, let’s create a test script (test.sh). Note that test scripts can be written in your language of choice, including python, R, bash and JavaScript. It doesn’t even need to be written in the same scripting language as your main script, as long as all the required dependencies are available! This means your main script could be written in R or python, but the unit test in Bash.

#!/bin/bash

echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
 --input "$meta_resources_dir/test.paired_end.sorted.bam" \
 --output "$meta_resources_dir/test.paired_end.sorted.txt"

echo ">>> Checking whether output is non-empty"
[ ! -s "$meta_resources_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1

echo ">>> Checking whether output is correct"

diff <(grep -v "^# The command" "$meta_resources_dir/test.paired_end.sorted.txt") \
   <(grep -v "^# The command" "$meta_resources_dir/ref.paired_end.sorted.txt") || \
   (echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)

rm "$meta_resources_dir/test.paired_end.sorted.txt"

echo ">>> All tests passed successfully."

exit 0

This test script makes handy use of meta variables, made available by Viash in the runtime environment.
Next, we update the Viash config (viash.config.yaml) with the following test resources:

name: samtools_stats

arguments:
…

test_resources:
  - type: bash_script
    path: test.sh
  - type: file
    path: test.paired_end.sorted.bam

engines:
…

Note that multiple unit tests and test data can be defined in the test_resources section of your Viash script, all will be evaluated.

Testing the component is now as simple as executing a single CLI command.

viash ns test -q samtools_stats

Why Viash Testing is a Game-Changer

This built-in testing approach provides several key advantages:

Containerized Testing Environment: Tests run in the exact same environment as your production code, eliminating “works on my machine” problems
Consistent Resources: Test data and scripts are version-controlled alongside your main script
CI/CD Integration: Tests can be easily integrated into CI/CD pipelines, facilitating long-term project maintainability

Parallel Processing: Built-In Batch Mode

The Parallel Processing Challenge in Bioinformatics

One of the most common requirements in bioinformatics is processing multiple samples efficiently. In order to manage basic requirements like resource management, logging, monitoring, etc. the typical bioinformatics answer is: use more tools, write more scripts.

With Viash, batch processing comes built-in. Let’s see how this works with our SAMtools example.

The Viash Way: Powerful Parameter Lists

First, we create a param_list file (param_list.yaml), where we define the different samples we want to process.

- id: sample_1
  input: test.paired_end.sorted_1.bam
  output: test.paired_end.sorted_1.bam
- id: sample_2
  input: test.paired_end.sorted_2.bam
  output: test.paired_end.sorted_2.bam
- id: sample_3
  input: test.paired_end.sorted_3.bam
  output: test.paired_end.sorted_3.bam

The Viash framework has transformed our script into a standalone Nextflow module, as described in our previous blog post. This enables us to take advantage of Nextflow’s multi-event DataFlow channels for efficient parallel processing capabilities.
The param_list.yaml file can be passed as a parameter to the Nextflow module via the CLI, for parallel, asynchronous processing of the samples defined in the file.

nextflow run target/nextflow/samtools_stats/main.nf \
  --param_list param_list.yaml \
  -profile docker \
  -publish-dir test

Why Viash Batch Processing is a Game-Changer

Efficient Parallel Processing: Built-in asynchronous execution automatically distributes multiple samples across available computing resources without requiring custom parallelization code
Simple Parameter Files: Process multiple datasets simultaneously using straightforward parameter lists without needing Nextflow expertise
Flexible Parameter Management: Supports passing event-specific parameters, allowing unique configurations for each sample while maintaining workflow integrity

For a deeper dive into the capabilities of the param_list functionality, you can check out the documentation.

Reproducibility: Simplified Container Management

The Reproducibility Problem

Bioinformaticians frequently encounter the frustrating “works on my machine” problem - scripts run perfectly on your system but fail on a colleague’s computer or when moved to HPC/cloud environments.
Container technologies like Docker solve this by packaging software with its dependencies, but introduce their own complexity:

Tracking container versions for reproducibility becomes a burden
Writing Dockerfiles requires specialized knowledge
Managing build processes is time-consuming
Configuring proper volume mounts and permissions is error-prone

Automated Container Management with Viash

When building and running a Viash component, various Docker procedures are handled under the hood:

Generation of the appropriate Dockerfile
Building of the runtime container with optional caching for efficiency
Set-up of proper volume mounts and working directories
Automatic management of container lifecycle and cleanup

Viash takes container management out of your hands while giving you full control over the container specification. For example, you can add the following custom Docker setup to your Viash config (config.vsh.yaml).

engines:
  - type: docker
    image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
    setup:
      - type: docker
        run: |
          samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \
          sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt

We can inspect the Dockerfile that is auto-generated by Viash as follows:

viash run src/config.vsh.yaml ---dockerfile

As a bonus, Viash simplifies debugging within the container environment with built-in debugging commands!

viash run src/config.vsh.yaml ---debug

Why Viash Containerization Management is a Game-Changer

Zero Docker Knowledge Required: Define dependencies without learning Docker syntax
Consistent Environments: The same container configuration works everywhere
Version Transparency: Container versions are explicitly defined in your config
Build Caching: Viash intelligently caches container builds to save time
Multiple Container Technologies: Works with Docker, Podman, or Singularity
Streamlined Container-Version Bookkeeping: Viash simplifies container and dependencies versioning

By simplifying and automating container management, Viash lets you focus on your analysis rather than wrestle with container configuration details. All while maintaining full visibility and control when you need it.

What’s Next?

In the next post, we’ll explore how to combine Viash components into powerful workflows that can handle complex bioinformatics pipelines like RNA-seq analysis.
Ready to learn more about testing and advanced features? Check out the Viash documentation.

Part 1: Tool Management Part 3: Workflow Building