Visit us at booth 37 during Knowledge for Growth on 8 May in Antwerp.

Batteries Included: Supercharging Bioinformatics Modules with Viash

Keywords

workflow automation, modular workflows, reproducible workflows, workflow development tools, batch processing, containerized workflows, automated testing workflows, pipeline development, scalable data workflows, workflow orchestration

Part 2: Batteries included: Supercharging Bioinformatics Modules with Viash

TL;DR: Viash comes with powerful built-in features that would normally require significant additional coding: parallel batch processing for speed, container management for reproducibility, and integrated testing for reliability. These “batteries included” features save you from writing hundreds of lines of boilerplate code.

In our previous post, we introduced how Viash simplifies bioinformatics tool management by transforming scripts into self-contained components. Now, let’s explore three powerful built-in capabilities that make Viash components truly production-ready.

Reliability: Integrated Testing

The Testing Challenge in Bioinformatics

Testing bioinformatics tools traditionally requires:

These tasks are often skipped due to time constraints, leading to unreliable tools and hard-to-track bugs. Viash solves this by making testing a first-class citizen in the component lifecycle.

Built-in Testing with Viash

Let’s get back to our SAMtools example from the previous post in this series. To add unit tests, we can simply add a test script alongside our script and Viash config, then update the config to include testing.

First, let’s create a test script (test.sh). Note that test scripts can be written in your language of choice, including python, R, bash and JavaScript. It doesn’t even need to be written in the same scripting language as your main script, as long as all the required dependencies are available! This means your main script could be written in R or python, but the unit test in Bash.

#!/bin/bash

echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
 --input "$meta_resources_dir/test.paired_end.sorted.bam" \
 --output "$meta_resources_dir/test.paired_end.sorted.txt"

echo ">>> Checking whether output is non-empty"
[ ! -s "$meta_resources_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1

echo ">>> Checking whether output is correct"

diff <(grep -v "^# The command" "$meta_resources_dir/test.paired_end.sorted.txt") \
   <(grep -v "^# The command" "$meta_resources_dir/ref.paired_end.sorted.txt") || \
   (echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)

rm "$meta_resources_dir/test.paired_end.sorted.txt"

echo ">>> All tests passed successfully."

exit 0

This test script makes handy use of meta variables, made available by Viash in the runtime environment.
Next, we update the Viash config (viash.config.yaml) with the following test resources:

name: samtools_stats

arguments:


test_resources:
  - type: bash_script
    path: test.sh
  - type: file
    path: test.paired_end.sorted.bam

engines:

Note that multiple unit tests and test data can be defined in the test_resources section of your Viash script, all will be evaluated.

Testing the component is now as simple as executing a single CLI command.

viash ns test -q samtools_stats

Why Viash Testing is a Game-Changer

This built-in testing approach provides several key advantages:


Parallel Processing: Built-In Batch Mode

The Parallel Processing Challenge in Bioinformatics

One of the most common requirements in bioinformatics is processing multiple samples efficiently. In order to manage basic requirements like resource management, logging, monitoring, etc. the typical bioinformatics answer is: use more tools, write more scripts.

With Viash, batch processing comes built-in. Let’s see how this works with our SAMtools example.

The Viash Way: Powerful Parameter Lists

First, we create a param_list file (param_list.yaml), where we define the different samples we want to process.

- id: sample_1
  input: test.paired_end.sorted_1.bam
  output: test.paired_end.sorted_1.bam
- id: sample_2
  input: test.paired_end.sorted_2.bam
  output: test.paired_end.sorted_2.bam
- id: sample_3
  input: test.paired_end.sorted_3.bam
  output: test.paired_end.sorted_3.bam

The Viash framework has transformed our script into a standalone Nextflow module, as described in our previous blog post. This enables us to take advantage of Nextflow’s multi-event DataFlow channels for efficient parallel processing capabilities.
The param_list.yaml file can be passed as a parameter to the Nextflow module via the CLI, for parallel, asynchronous processing of the samples defined in the file.

nextflow run target/nextflow/samtools_stats/main.nf \
  --param_list param_list.yaml \
  -profile docker \
  -publish-dir test

Why Viash Batch Processing is a Game-Changer

For a deeper dive into the capabilities of the param_list functionality, you can check out the documentation.


Reproducibility: Simplified Container Management

The Reproducibility Problem

Bioinformaticians frequently encounter the frustrating “works on my machine” problem - scripts run perfectly on your system but fail on a colleague’s computer or when moved to HPC/cloud environments.
Container technologies like Docker solve this by packaging software with its dependencies, but introduce their own complexity:

Automated Container Management with Viash

When building and running a Viash component, various Docker procedures are handled under the hood:

Viash takes container management out of your hands while giving you full control over the container specification. For example, you can add the following custom Docker setup to your Viash config (config.vsh.yaml).

engines:
  - type: docker
    image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
    setup:
      - type: docker
        run: |
          samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \
          sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt

We can inspect the Dockerfile that is auto-generated by Viash as follows:

viash run src/config.vsh.yaml ---dockerfile

As a bonus, Viash simplifies debugging within the container environment with built-in debugging commands!

viash run src/config.vsh.yaml ---debug

Why Viash Containerization Management is a Game-Changer

  1. Zero Docker Knowledge Required: Define dependencies without learning Docker syntax
  2. Consistent Environments: The same container configuration works everywhere
  3. Version Transparency: Container versions are explicitly defined in your config
  4. Build Caching: Viash intelligently caches container builds to save time
  5. Multiple Container Technologies: Works with Docker, Podman, or Singularity
  6. Streamlined Container-Version Bookkeeping: Viash simplifies container and dependencies versioning

By simplifying and automating container management, Viash lets you focus on your analysis rather than wrestle with container configuration details. All while maintaining full visibility and control when you need it.


What’s Next?

In the next post, we’ll explore how to combine Viash components into powerful workflows that can handle complex bioinformatics pipelines like RNA-seq analysis.
Ready to learn more about testing and advanced features? Check out the Viash documentation.

Elevate your data workflows

Transform your data workflows with Data Intuitive’s complete support from start to finish.

Our team can assist with defining requirements, troubleshooting, and maintaining the final product, all while providing end-to-end support.

Contact Us