GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot

Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili. “GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot.” GPU Computing GEMS Jade Edition, 1st Edition, Chapter 30. September 2011.

Abstract

Graphics Processing Units (GPUs) offer considerable gains in performance for applications written to take advantage of data parallelism and of aspects of the underlying hardware such as banked memory, caches, and single-instruction multiple-thread (SIMT) execution. Efficient kernels executing on GPUs typically hide latency and maintain high throughput when hundreds or thousands of threads are launched. Consequently, complexities of the execution model, sensitivity to underlying architectural features, and the need for high concurrency present numerous challenges for developing correct and efficient GPU applications.

To address these challenges, we have implemented GPU Ocelot [1], a dynamic compilation framework for NVIDIA’s CUDA programming language and API that links with unmodified CUDA applications, analyzes data-parallel GPU kernels, and launches them on available processors. GPU Ocelot consists of (1) an implementation of the CUDA Runtime API, (2) a complete internal representation of PTX kernels coupled to control- and data-flow analysis procedures, (3) a functional emulator for PTX, (4) a translator to multicore x86-based CPUs for efficient execution, (5) and a backend to NVIDIA GPUs via the CUDA Driver API. Ocelot supports an extensible trace generation framework in which application behavior such as control-flow uniformity, memory access patterns, and data sharing may be observed at instruction-level granularity.

This chapter will discuss some implementation details of GPU Ocelot, particularly the implementation of the PTX emulator, and how GPU Ocelot may be used to prototype, debug, and tune CUDA applications for efficient execution on GPUs. This gem will explain how users may benefit from the rich application profiling and correctness tools built into Ocelot as well as how to extend Ocelot’s trace generator interface to perform custom workload characterization and profiling. Additionally, we will discuss GPU Ocelot’s role as a dynamic compilation framework for heterogeneous many-core compute systems that leverage GPUs and multicore CPUs.

Download

GPU Application Development, Debugging, and Performance Tuning with GPU Ocelot [Link]

Citation

@inbook{gpu-app-development-with-gpu-ocelot,
author = {Wen-mei Hwu, et al.},
title = {GPU Computing GEMS Jade Edition, 1st Edition},
chapter = {30},
publisher = {Morgan Kaufmann},
year = {2011}
}