Software Reliability Enhancements for GPU Applications

Software Reliability Enhancements for GPU Applications

Si Li, Naila Farooqui and Sudhakar Yalamanchili. “Software Reliability Enhancements for GPU Applications.” Sixth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2013), held in conjunction with the 8th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC). January 2013.

Abstract

As the role of highly-parallel accelerators becomes more important in high performance computing, so does the need to ensure their reliable operation. In applications where precision and correctness is a necessity, bit-level reliable operation is required. While there exist mechanisms for error detection and correction, the cost-effective implementation in massively parallel accelerators is still an active area of research. In this paper we present an alternative software based approach for improving the reliability of massively parallel bulk synchronous processors such as modern GPUs. Specfically, we propose a set of software reliability enhancements via transparent code patching of GPU applications. Reliability enhancements can be applied selectively at runtime, customized by the user, and transparent to the application. Runtime overhead ranges from 1-737% depending on the nature of the enhancement. We provide an analysis of benets and limitations.

Download

Citation

@inproceedings{li-software-reliability,
author={Li, Si and Farooqui, Naila and and Yalamanchili, Sudhakar},
booktitle={Sixth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2013)}, title={Software Reliability Enhancements for GPU Applications},
year={2013},
month={jan},
}