How Docker Can Provide Custom, Efficient Solutions for Bioinformaticians
In today’s age of data-driven bioinformatics, having reliable tools and systems to create workflows and pipelines which process and analyse that data has never been more important. Typically, each task will require one or more tools, such as alignment, quality control and variant calling. With each tool, however, comes a plethora of dependencies, setups and system requirements, frequently unique to the tool. Ensuring they run smoothly and efficiently together can be a rather painstaking endeavour. Furthermore, porting such pipelines to different machines could require readapting to the new system.
One of the most popular ways to tackle this issue is by using Docker, a containerization program. Docker containers can be considered mini virtual machines that package up the necessary dependencies for a tool into an environment that can then be run separately. For instance, certain BEDTools, TopHat, and Cufflinks require Python 2.x, which has been deprecated. By using a Docker container with Python 2. x installed, these tools can be easily run on any machine.
Docker containers are also easily ported, with all the necessary tools for an application, such as a pipeline, are bundled together and can be shipped as required. These containers can then be run on any system and ensure consistency for all users.
One of the most significant advantages of using Docker is efficient resource utilization. Its containers typically use less memory and storage than virtual machines. They also parallelize the running of various tools at once, so the combined runtime for the container is the runtime of the tool with the longest runtime.
We were recently approached to help a client create a variant analysis pipeline for unaligned long-reads obtained from a Pacific Biosciences machine. Each tool required separate setup, configuration, and resource allocation, leading to inconsistent environments, inefficiency in execution, and fragmented outputs. We opted to create a unified Docker-based solution that streamlines tool execution, enabling consistent environments, parallel processing, simplified commands, and consolidated outputs* for efficient genomic analysis.
Through this Docker solution, we were able to:
- Manage different dependencies for each tool within a single Docker image without version conflicts
- Balance CPU and memory usage to optimize performance across tools without resource contention
- Ensure tools can run in parallel without interference while handling interdependencies when they exist
- Implement standardized logging and robust error handling
- Ensure the Docker image is flexible for additional tools and scalable for larger datasets in the future
- Minimize Docker overhead and improve runtime efficiency to prevent the image from slowing down the analysis pipeline.
- Store parameters in config files for easy sharing and reproducibility
- Provide the option to run a subset of tools as per user’s need.
By running these tools through this Docker container rather than individually, we reduced the overall runtime of using each tool from 204 minutes to a combined 120 minutes.
Using Dockerized solutions can save bioinformaticians time, resources and even future headaches and allow for efficient, reproducible research.
Contact us to learn more about how we can deliver a custom Docker solution for you!