Batteries Included: Supercharging Bioinformatics Modules with Viash
workflow automation, modular workflows, reproducible workflows, workflow development tools, batch processing, containerized workflows, automated testing workflows, pipeline development, scalable data workflows, workflow orchestration
Part 2: Batteries included: Supercharging Bioinformatics Modules with Viash
TL;DR: Viash comes with powerful built-in features that would normally require significant additional coding: parallel batch processing for speed, container management for reproducibility, and integrated testing for reliability. These “batteries included” features save you from writing hundreds of lines of boilerplate code.
In our previous post, we introduced how Viash simplifies bioinformatics tool management by transforming scripts into self-contained components. Now, let’s explore three powerful built-in capabilities that make Viash components truly production-ready.
Reliability: Integrated Testing
The Testing Challenge in Bioinformatics
Testing bioinformatics tools traditionally requires:
- Writing custom test scripts
- Managing test data
- Setting up test environments
- Tracking expected outputs
These tasks are often skipped due to time constraints, leading to unreliable tools and hard-to-track bugs. Viash solves this by making testing a first-class citizen in the component lifecycle.
Built-in Testing with Viash
Let’s get back to our SAMtools example from the previous post in this series. To add unit tests, we can simply add a test script alongside our script and Viash config, then update the config to include testing.
First, let’s create a test script (test.sh
). Note that test scripts can be written in your language of choice, including python, R, bash and JavaScript. It doesn’t even need to be written in the same scripting language as your main script, as long as all the required dependencies are available! This means your main script could be written in R or python, but the unit test in Bash.
#!/bin/bash
echo ">>> Testing $meta_functionality_name"
"$meta_executable" \
--input "$meta_resources_dir/test.paired_end.sorted.bam" \
--output "$meta_resources_dir/test.paired_end.sorted.txt"
echo ">>> Checking whether output is non-empty"
[ ! -s "$meta_resources_dir/test.paired_end.sorted.txt" ] && echo "File 'test.paired_end.sorted.txt' is empty!" && exit 1
echo ">>> Checking whether output is correct"
diff <(grep -v "^# The command" "$meta_resources_dir/test.paired_end.sorted.txt") \
<(grep -v "^# The command" "$meta_resources_dir/ref.paired_end.sorted.txt") || \
(echo "Output file ref.paired_end.sorted.txt does not match expected output" && exit 1)
rm "$meta_resources_dir/test.paired_end.sorted.txt"
echo ">>> All tests passed successfully."
exit 0
This test script makes handy use of meta variables, made available by Viash in the runtime environment.
Next, we update the Viash config (viash.config.yaml
) with the following test resources:
name: samtools_stats
arguments:
…
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test.paired_end.sorted.bam
engines:
…
Note that multiple unit tests and test data can be defined in the test_resources
section of your Viash script, all will be evaluated.
Testing the component is now as simple as executing a single CLI command.
viash ns test -q samtools_stats
Why Viash Testing is a Game-Changer
This built-in testing approach provides several key advantages:
- Containerized Testing Environment: Tests run in the exact same environment as your production code, eliminating “works on my machine” problems
- Consistent Resources: Test data and scripts are version-controlled alongside your main script
- CI/CD Integration: Tests can be easily integrated into CI/CD pipelines, facilitating long-term project maintainability
Parallel Processing: Built-In Batch Mode
The Parallel Processing Challenge in Bioinformatics
One of the most common requirements in bioinformatics is processing multiple samples efficiently. In order to manage basic requirements like resource management, logging, monitoring, etc. the typical bioinformatics answer is: use more tools, write more scripts.
With Viash, batch processing comes built-in. Let’s see how this works with our SAMtools example.
The Viash Way: Powerful Parameter Lists
First, we create a param_list
file (param_list.yaml
), where we define the different samples we want to process.
- id: sample_1
input: test.paired_end.sorted_1.bam
output: test.paired_end.sorted_1.bam
- id: sample_2
input: test.paired_end.sorted_2.bam
output: test.paired_end.sorted_2.bam
- id: sample_3
input: test.paired_end.sorted_3.bam
output: test.paired_end.sorted_3.bam
The Viash framework has transformed our script into a standalone Nextflow module, as described in our previous blog post. This enables us to take advantage of Nextflow’s multi-event DataFlow channels for efficient parallel processing capabilities.
The param_list.yaml
file can be passed as a parameter to the Nextflow module via the CLI, for parallel, asynchronous processing of the samples defined in the file.
nextflow run target/nextflow/samtools_stats/main.nf \
--param_list param_list.yaml \
-profile docker \
-publish-dir test
Why Viash Batch Processing is a Game-Changer
- Efficient Parallel Processing: Built-in asynchronous execution automatically distributes multiple samples across available computing resources without requiring custom parallelization code
- Simple Parameter Files: Process multiple datasets simultaneously using straightforward parameter lists without needing Nextflow expertise
- Flexible Parameter Management: Supports passing event-specific parameters, allowing unique configurations for each sample while maintaining workflow integrity
For a deeper dive into the capabilities of the param_list
functionality, you can check out the documentation.
Reproducibility: Simplified Container Management
The Reproducibility Problem
Bioinformaticians frequently encounter the frustrating “works on my machine” problem - scripts run perfectly on your system but fail on a colleague’s computer or when moved to HPC/cloud environments.
Container technologies like Docker solve this by packaging software with its dependencies, but introduce their own complexity:
- Tracking container versions for reproducibility becomes a burden
- Writing Dockerfiles requires specialized knowledge
- Managing build processes is time-consuming
- Configuring proper volume mounts and permissions is error-prone
Automated Container Management with Viash
When building and running a Viash component, various Docker procedures are handled under the hood:
- Generation of the appropriate Dockerfile
- Building of the runtime container with optional caching for efficiency
- Set-up of proper volume mounts and working directories
- Automatic management of container lifecycle and cleanup
Viash takes container management out of your hands while giving you full control over the container specification. For example, you can add the following custom Docker setup to your Viash config (config.vsh.yaml
).
engines:
- type: docker
image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1
setup:
- type: docker
run: |
samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \ sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt
We can inspect the Dockerfile that is auto-generated by Viash as follows:
viash run src/config.vsh.yaml ---dockerfile
As a bonus, Viash simplifies debugging within the container environment with built-in debugging commands!
viash run src/config.vsh.yaml ---debug
Why Viash Containerization Management is a Game-Changer
- Zero Docker Knowledge Required: Define dependencies without learning Docker syntax
- Consistent Environments: The same container configuration works everywhere
- Version Transparency: Container versions are explicitly defined in your config
- Build Caching: Viash intelligently caches container builds to save time
- Multiple Container Technologies: Works with Docker, Podman, or Singularity
- Streamlined Container-Version Bookkeeping: Viash simplifies container and dependencies versioning
By simplifying and automating container management, Viash lets you focus on your analysis rather than wrestle with container configuration details. All while maintaining full visibility and control when you need it.
What’s Next?
In the next post, we’ll explore how to combine Viash components into powerful workflows that can handle complex bioinformatics pipelines like RNA-seq analysis.
Ready to learn more about testing and advanced features? Check out the Viash documentation.