Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs
There has been considerable success in harnessing the superior compute and memory bandwidth of GPUs to accelerate traditional scientific and engineering computations dominated by structured control and data flows across large data sets. These applications can be effectively mapped to the rigid 1D-3D massively parallel thread block structures underlying modern bulk synchronous parallel (BSP) programming languages for GPUs. However, emerging data intensive applications in analytics, planning, retail forecasting and similar applications are dominated by sophisticated algorithms characterized by irregular control, data, and memory access flows challenging the effective harnessing of GPU accelerators.
Despite the above observation, there still exits segments of the computation within many irregular applications that locally exhibit structured control and memory access behaviors. These Dynamically Formed pockets of data Parallelism (DFP) occur in a data dependent, nested, time-varying manner. The CUDA Dynamic Parallelism (CDP) model extends the base CUDA programming model with device-side nested kernel launch capabilities to enable programmers to exploit this dynamic evolution of parallelism in applications. However, while these extensions do address the productivity and algorithmic issues, the ability to harness modern high performance hardware accelerators such as GPUs is still difficult in most cases due to the non-trivial overhead. To this end, we propose and construct a new benchmark suite, Dragon_li, to facilitate investigation and studies on irregular applications that feature fine-grained dynamic parallelism on modern GPU architectures.
To address the challenges for dynamic parallelism, we propose a refinement to the traditional BSP execution model – Dynamic Thread Block Launch (DTBL). This is a lightweight mechanism for the dynamic and nested spawning of thread blocks (TBs) and aggregating them to existing native executing kernels. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. The finer granularity of a thread block provides effective and efficient control of smaller-scale, dynamically occurring pockets of structured parallelism during the computation. With the semantics specification of DTBL, microarchitecture extensions implementation and necessary runtime and compiler support, DTBL is able to achieve average 1.21x speedup over the original flat implementation and an average 1.40x over the implementation with device-side kernel launches using CDP when running with a set of irregular data intensive CUDA applications on GPGPUSim.
- J. Wang, N. Rubin, A. Sidelnik and S. Yalamanchili. “Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs.” The 42nd International Symposium on Computer Architecture (ISCA). June 2015. paper
- J. Wang and S. Yalamanchili. “Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications.” 2014 IEEE International Symposium on Workload Characterization (IISWC). October 2014. (Best paper nominee) paper
The papers are provided for personal use and are subject to copyright of the publishers.