Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications

Jin Wang and Sudhakar Yalamanchili. “Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications.” 2014 IEEE International Symposium on Workload Characterization (IISWC). October 2014.

Abstract

GPUs have been proven very effective for structured applications. However, emerging data intensive applications are increasingly unstructured – irregular in their memory and control flow behavior over massive data sets. While the irregularity in these applications can result in poor workload balance among fine-grained threads or coarse-grained blocks, one can still observe dynamically formed pockets of structured data parallelism that can locally effectively exploit the GPU compute and memory bandwidth.

In this study, we seek to characterize such dynamically formed parallelism and and evaluate implementations designed to exploit them using CUDA Dynamic Parallelism (CDP) – an execution model where parallel workload are launched dynamically from within kernels when pockets of structured parallelism are detected. We characterize and evaluate such implementations by analyzing the impact on control and memory behavior measurements on commodity hardware. In particular, the study targets a comprehensive understanding of the overhead of current CDP support in GPUs in terms of kernel launch, memory footprint and algorithm overhead. Experiments show that while the CDP implementation can generate potentially 1.13x-2.73x speedup over non-CDP implementations, the non-trivial overhead causes the overall performance an average of 1.21x slowdown.

Download

paper

Citation

@inproceedings{wang_iiswc2014,
author={Jin Wang and Sudhakar Yalamanchili},
booktitle={2014 IEEE International Symposium on Workload Characterization},
title={Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications},
year={2014},
month={October},
}