Scientific workflows are an approach to implement automated, scalable, portable, and reproducible data analyses and in-silico experiments with low development costs.
What sets this approach apart from other distributed computing paradigms is its focus on the composition of programs. As a bioinformatics example, the output of a program that aligns reads to a reference genome is often processed by another program that analyzes genomic variants. Each of the programs is assumed to be readily available and is treated as a black box.
This contrasts distributed computing approaches that are centered around a specific data model like key-value pairs for MapReduce or specific communication primitives between parallel processes, like the message passing interface. These low-level abstractions for distributed computing can be used to develop distributed versions of individual programs in a workflow, e.g., a distributed variant caller in Spark. This allows for high performance but also induces high development costs. Scientific workflows on the other hand achieve scalability by distributing executions of other programs. In practice, these other programs tend to be non-distributed, i.e., executable on a single compute node, but various workflow systems also allow for the composition of distributed programs or even workflows within a workflow.
The executions can have dependencies, e.g., the variant caller consumes the output of the read aligner and can not be started before the aligner has finished. Each execution of a program is referred to as a task. The tasks are assumed to be isolated, i.e., do not interact with other tasks but only with their inputs and outputs as defined through dependencies. The dependencies implicitly define which tasks can be executed in parallel and which tasks need to be executed in sequence. Since this information can be inferred automatically from the workflow description, distribution can be achieved in an automated fashion, which reduces development costs to composing the workflow and harmonizing the input and output formats of the programs in the workflow.
The use of scientific workflows over low-level abstractions is appropriate where re-implementation is prohibitively expensive. This is often the case because of the complexity of high-level analyses that can be composed by dozens of complex programs. Due to the routinely unique nature of scientific analyses these implementations can seldom be re-used in other projects which makes it inherently difficult to amortize the high development costs of high-performance implementations.
Achieving automation, scalability, portability, and reproducibility of a workflow induces development costs. Automation in the sense of reducing human interactions with the workflow to a single interaction that starts an application is a corner stone for achieving the other properties. In conjunction with distribution, automation is fundamental for scalability, i.e., varying the amount of work and used resources. Without reducing human interactions, a workflow’s scale is clearly limited by human resources and without distribution, a workflow’s scale is limited by the capabilities of current compute servers.
Automation is also a basic prerequisite for reproducibility, because it requires an explicit description of all necessary steps for producing the result. Eliminating human interactions is essential to reduce the uncertainty about how a computational result was produced. By also automating the deployment of the workflow, i.e., the installation of required tools and their operating system context, this uncertainty can be eliminated almost completely.
Although computer scientists tend to take automation efforts for granted, this is not necessarily the case in other domains. Some teams may lack the expertise or time to afford a short-term investment in automation for long-term benefits, i.e., automation should not be taken for granted. Overall, automation may increase productivity by binding fewer resources for checking and sustaining the progress of a workflow. On the downside, this requires a one-time investment to formalize the decisions and actions in the workflow and possibly causes long-term costs related to the maintenance of the resulting code.
A second corner stone of scientific workflows is abstraction, i.e., describing the workflow in an intermediate, more concise representation. This is essential for achieving portability, i.e., the ability to run the workflow in different execution environments without having to rewrite the workflow. Portability is related to scalability because it allows moving to more powerful compute environments and it is essential for a broader sense of reproducibility, because reproducing results does not require the original development environment.
Protecting the workflow definition from technical details allows for faster workflow development and workflow re-use across compute environments. However, abstraction also limits expressiveness, e.g., some workflow languages do not allow for dynamic branching or iteration. Abstraction also naturally limits optimization opportunities by hiding implementation details and thus limits performance. The relationship between abstraction and automation asymmetric: workflow languages allow for concise representations of workflows that serve the purpose of being translated to executable representations, but automation is also possible without abstraction by scripting the workflow from scratch. However, a one time investment to adopt a workflow management system quickly pays off if scalability, portability, and reproducibility are required.
A workflow language provides abstractions for frequent patterns in scientific workflows and a workflow management system provides the software to execute a workflow defined using these abstractions. The basic idea is to trade a small development investment, i.e., describing a workflow in terms of the abstractions of a language, for a large benefit provided by using ready-made, tested, and optimized implementations of these abstractions.
Arguably the central abstraction in every workflow language is the dependency. A dependency between tasks indicates that these tasks have to be executed in sequence, without specifying a concrete mechanism to ensure this execution order. The absence of dependencies indicates that tasks can be executed in parallel. Various approaches to define tasks and their dependencies exist. The most basic ones require the explicit construction of a directed acyclic graph where tasks correspond to vertices and dependencies are expressed through directed arcs between vertices. Bottom-up approaches inspired by build software like make accept a set of rules that state how to produce output files from input files. When filenames are allowed to contain wildcards, a set of rules can be used to automatically derive directed acyclic graphs for varying inputs, rather than having to write a program that constructs the appropriate graph. Based on which output files are desired, only the necessary portion of the graph can be constructed and executed. However, the depth of the graph is limited by the number of rules. A more expressive approach is based on the concept of processes connected by channels. Channels are communication queues that pass data between processes and can be composed via operators into other channels. A process polls the next value of each of its input channels and creates a task with the results. This allows for tasks that are created dynamically at runtime, based on the output of other tasks, and thus allows for powerful constructs like feedback loops. An equally expressive approach is based on functional programming, where tasks correspond to function evaluations and dependencies are expressed through function composition. In this setting, a dependency takes the form of a nested expression where the inner expression has to be evaluated before the outer expression.
A second central abstraction in scientific workflows concerns the physical compute resources used to execute a task. Tasks in the workflow is defined against a task environment that is provided by all compute nodes and makes the underlying physical resources exchangeable. A typical task environment consists of a command-line interface to the operating system, a set of executables, and a file system.
For instance, the workflow could expect to run in a POSIX compliant environment, which provides standard means for working with files and execute programs.
Current workflow languages abstract from the way the task environment is accessed and allow workflow tasks to be scripted directly in the desired environment.
On the downside of hiding technical details, it can be difficult to assess the capabilities and limitations of a specific language. In addition, different workflow languages provide different abstractions or different concepts for them, as exemplified by the dependency abstraction above. This increases the risk of discovering the limitations of a language only after having invested development time into its adoption.
The problem is aggravated by the large variety of languages currently available, which also implies a risk to loose prior development efforts when being forced to move to another workflow language, e.g., because of a research grant.
Similar problems arise when being forced to adopt a different workflow management system, for instance because of a change in the software stack that operates a compute infrastructure. For the specification of task graphs, different standardization efforts have recently been undertaken, such as the Common Workflow Language or the Workflow Description Language. Despite the seeming simplicity of the workflow concept, a wide range of use cases and requirements exists in practice, which slows down the development and adoption of standards.
Workflow Management Systems
A workflow management system executes a workflow specification written in a workflow language. It handles the technical details that the specification abstracts from.
Algorithmically, enforcing dependencies is not a very difficult task. In the case where dependencies are expressed as a directed acyclic graph, a valid schedule can be computed in linear time using an algorithm that iteratively assigns tasks to compute resources that have no incoming dependencies, similar to a topological sort of the graph. In contrast, problems concerned with construction of schedules that are not only valid but in some sense optimal are usually NP-complete. In practice, a workflow management system also can not simply abide a given schedule because task completion times can not be planned exactly. Instead, task execution can be implemented using asynchronous callbacks that notify the workflow management system about task completion or failure and trigger the execution of downstream tasks or another attempt of the failed task.
A lot of technical details are related to the implementation of the task environment abstraction. This comprises negotiating and allocating computational resources in scenarios where a distributed resource manager is involved but also moving input and output files (“data staging”), software deployment, and subsequent clean up operations. The implementation of these steps depends on the concrete resource manager, the operating system of the compute nodes, and the mechanism for handling data movements, e.g., a distributed file system in a cluster or an object storage provided by a cloud environment. For software deployment, current workflow management systems support the transparent use of different container engines.
Container engines like Docker, Singularity, and Rkt have simplified the deployment of task environments by using Linux kernel features that allow for bundling file systems, executables, operating system utilities in a single file, as well as isolating the resources used by processes within these containers. This provides a light-weight alternative to virtual machines for the controlled deployment of consistent task environments.
The large number of workflow languages is partly a result of the large number of available workflow management systems. During the last two decades, hundreds of workflow management systems have been developed and made available to the general public but currently only a handful supports the common workflow language.
Often, workflow management systems define their own language or application programming interface. The landscape workflow management systems covers large projects like Pegasus which have been developed over long periods of time, and many more smaller projects developed mostly by individuals for specific projects.
There is also an evident distinction between systems by and for academic users and systems from industrial applications. The latter have often been developed in house and open sourced later to benefit from external development efforts, such as Spotify’s Luigi or Airbnb’s Airflow. Systems from an academic background are usually documented by accompanying publications, while industrial systems rely on software documentation and more colloquial blog entries. Both sides also differ in their preferred technology stacks, where the academic systems prefer batch and grid scheduling solutions like HTCondor, and many industrial systems are centered around Apache software like the Hadoop ecosystem. The Hi-WAY workflow management systems are a notable exception stemming from an academic background but adopting Hadoop, a software solution for distributed computing that is widely adopted in commercial applications.