Exploring The Latency and Bandwidth Tolerance of CUDA Applications

Exploring The Latency and Bandwidth Tolerance of CUDA Applications

Sudnya Padalikar and Gregory Diamos. “Exploring The Latency and Bandwidth Tolerance of CUDA Applications.” NFinTes Tech Report. December 2009.

Abstract

CUDA applications represent a new body of parallel programs. Although several paradigms exist for programming distributed systems and many-core processors, many users struggle to achieve a program that is scalable across systems with different hardware characteristics. This paper explores the scalability of CUDA applications on systems with varying interconnect latencies, hiding a hardware detail from the programmer and making parallel programming more accessible to non-experts. We use a combination of the Ocelot PTX emulator and a discrete event simulator to evaluate the UIUC Parboil benchmarks on three distinct GPU configurations. We find that these applications are sensitive to neither interconnect latency nor bandwidth, and that integrated GPU-CPU systems are not likely to perform any better than discrete GPUs or GPU clusters.

Download

Exploring The Latency and Bandwidth Tolerance of CUDA Applications [PDF]

Citation

@techreport{exploring-the-latency-and-bandwidth-tolerance-of-cuda-applications,
title = {Exploring The Latency and Bandwidth Tolerance of CUDA Applications},
company = {Georgia Institute of Technology},
author = {Sudnya Padalikar and Gregory Diamos},
year = {2009},
month = {December},
}