GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot
Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili. “GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot.” GPU Computing GEMS Jade Edition, 1st Edition, Chapter 30. September 2011.
Abstract
To address these challenges, we have implemented GPU Ocelot [1], a dynamic compilation framework for NVIDIA’s CUDA programming language and API that links with unmodified CUDA applications, analyzes data-parallel GPU kernels, and launches them on available processors. GPU Ocelot consists of (1) an implementation of the CUDA Runtime API, (2) a complete internal representation of PTX kernels coupled to control- and data-flow analysis procedures, (3) a functional emulator for PTX, (4) a translator to multicore x86-based CPUs for efficient execution, (5) and a backend to NVIDIA GPUs via the CUDA Driver API. Ocelot supports an extensible trace generation framework in which application behavior such as control-flow uniformity, memory access patterns, and data sharing may be observed at instruction-level granularity.
This chapter will discuss some implementation details of GPU Ocelot, particularly the implementation of the PTX emulator, and how GPU Ocelot may be used to prototype, debug, and tune CUDA applications for efficient execution on GPUs. This gem will explain how users may benefit from the rich application profiling and correctness tools built into Ocelot as well as how to extend Ocelot’s trace generator interface to perform custom workload characterization and profiling. Additionally, we will discuss GPU Ocelot’s role as a dynamic compilation framework for heterogeneous many-core compute systems that leverage GPUs and multicore CPUs.
Download
GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot [Link]
Citation
author = {Wen-mei Hwu, et al.},
title = {GPU Computing GEMS Jade Edition, 1st Edition},
chapter = {30},
publisher = {Morgan Kaufmann},
year = {2011}
}