I’ve reached the end of my first year as a PhD student. This is a brief summary of my topic, progress, and future directions.
Increasing data collection rates in science necessitate the development of a convenient software stack to run data analyses pipelines on distributed hardware. Since many researchers rely on complex third-party software in their analysis pipelines, it can be cumbersome or even prohibitive to use an application programming interface for distributed computing like Spark to implement these analysis pipelines.
The scientific workflow paradigm has proven effective as a means to parallelize compute-intensive data analyses that rely on arbitrary software. Conceptually, a scientific workflow is a directed acyclic graph, where each vertex corresponds to the invocation of an arbitrary program (a task) that processes the output data of its parent tasks.
Because of the unlimited variety of programs that can be used as tasks and their complex composition into analysis pipelines, it is difficult to predict the resource requirements and duration of a workflow and its steps. This makes it challenging to schedule the work satisfactorily to achieve short workflow execution durations by high resource utilization. Predicting resource requirements, e.g., data transfer and task execution times, is difficult because of the complexity of the infrastructure and the software stack managing it. On the hardware side, multicore architectures and deep memory hierarchies can cause performance variation across heterogeneous compute nodes or performance interference effects caused by parallel processes competing for the resources on a compute node. On the software side, distributed file systems and resource managers often exhibit complex behavior that also depends on several configuration parameters.
I plan to use machine learning to automatically determine and model the resource requirements of tasks and workflows on the basis of runtime monitoring data. Since workflows are usually running for hours or days, I would like to build a system that observes the execution process and adapts by replanning or even by provisioning different resources in a cloud environment. Additional information can be drawn from prior runs of the workflow or the individual tasks it contains.
Using these models, runtime estimates can be provided for different resource configurations that support scientists in fixing dates and planning the scale or the cost of their experiments.
As a starting point, I use the SAASFEE software stack  for executing distributed scientific workflows. The first step is to instrument the execution engine to collect and gather runtime monitoring data. The second step is to implement a component that analyzes and extrapolates this data to future situations. Finally, this information has to be used by an optimizer to adapt the execution plan.
This approach is based on two central assumptions, which are yet to be experimentally confirmed.
- Estimation based planning offers large performance gains compared to greedy approaches to load balancing or scheduling.
- The content of a task’s input data has a minor impact on its behavior.
Regarding the first assumption, I believe that workflow structure is an important aspect that makes thorough planning worth it. For instance, the popular Heterogeneous Earliest Finish Time  scheduling heuristic prioritizes tasks that are part of the critical path and has been shown  to outperform other scheduling heuristics. Agullo et al.  have recently compared static and dynamic scheduling strategies in heterogeneous computing environments, i.e., computing on CPUs and GPUs, and found that static schedule quality compensates for a lack of flexibility at runtime.
The second assumption arises from the fact that programs and data are treated as a black box, as opposed for instance to database query optimization. While it is clear that processing larger files can be expected to take longer, I expect the contents of the file to have only a minor impact. When processing very many data items in a data parallel way, the data chunks might be expected to often have similar properties, or to be often subject to simple transformations that do not depend on the actual data.
I spent my first year on mapping out the research on statistical performance prediction. I’m almost done structuring the material and writing it down as a survey paper. The literature review shows a research gap in detailed models of program behavior, which is especially interesting since the tasks in a scientific workflow can be long running. First, most of the papers focus on predicting the execution duration, whereas disk and main memory requirements are seldomly considered. While predicting these quantities is essentially the same problem (from the black box point of view acquired in this thesis) their actual use in planning the workflow execution seems promising. Second, coarse aggregates of resource usage, such as execution duration for CPU are still state-of-the-art. Since tasks can be long running and program invocations costly, it seems reasonable to collect data at a more detailed level, e.g., on a time series basis. A project that aims at a similar direction is the PANORAMA project  at Ewa Deelman’s group.
POSSIBLE NEXT STEPS
- Synthetic workflows for evaluation of workflow schedulers
- Simulation experiments to examine the relationship between prediction accuracy and scheduling quality
- Regression models to predict task resource usage
- Workflow-tailored cloud resource provisioning
- Multicore interference-aware scheduling
- Active learning approaches to scheduling
- Time series for more detailed resource usage characterization of long-running tasks
- Exploiting intra-task parallelism as a degree of freedom for scheduling
- Container-technology in workflow systems
- Knowledge transfer between workflow runs, normalization of observations