My literature review has been accepted in Information Systems. It covers methods to predict the resource usage of batch computing jobs and is available here.
This work lays the foundations of my PhD thesis, providing an overview of what has been solved and how, the research gaps, and the people and publication venues that form the communities that I relate to with my research. Although it was a challenging project as first publication, I’ve learned a lot in the process. Being forced to construct my own view of the field was probably the most valuable thing.
A challenge was that this investment needed years to amortize. I’m not sure whether I’d recommend writing a survey as a first step in your PhD thesis. Getting a workshop paper out early has the advantage of being more rewarding and, more importantly, getting to discuss your ideas with different people in the field. On the other hand, you need to review the state of the art for your thesis anyways, and deferring it to the end is not an option either. Having to do the work, I thought I could as well publish it. In hindsight, it’s of course not that easy. With my mind set to a survey paper, I needed to delve much deeper than what would have sufficed to get me started.
In a nutshell, the paper covers the following:
- An introduction to some basic problems associated to scheduling and resource allocation in batch systems. For instance, large scale compute jobs (a simulation or a data analysis) can be run on a variety of configurations: Different numbers of compute nodes, at different compute sites, etc. Predicting execution durations, wait times, and the necessary resources to run the job helps in selecting good configurations.
- Four principal performance factors, i.e., sources of performance variation extracted from the literature. Workload patterns describe the way a job uses resources, e.g., whether it’s compute intensive or I/O intensive, and possibly how this changes over time. Resource Heterogeneity affects performance, because job performance depends on the hardware it runs on. Scale describes performance variations as the amount of work, e.g., the amount of data to process, varies or, similarly, the amount of resources vary. Contention describes performance variations due to collocation decisions, i.e., compute intensive tasks competing for on-chip resources or data transfers competing for network bandwidth.
- A detailed taxonomy on predictive performance modeling. I classified approaches according to explanatory variables, predicted metrics, the principal performance factors taken into account, prediction method, and challenges and limitations.
- A review of 16 approaches on the task level, i.e, predicting performance for programs running on a single compute node, and 17 approaches on the job level, i.e., predicting performance for compute jobs that span multiple compute nodes.
- An overview on various performance monitoring techniques that can be used to collect training data for the prediction models.
- A discussion of common limitations and challenges in the approaches, and their implications on research needs.
At the last conferences I attended, I found that the interest in predictive resource allocation is still growing. I hope my survey will help other researchers in the field to get an overview more quickly and makes the work in the field easier to access.