Software-based Dynamic Reliability Management for GPU Applications
Si Li, Vilas Sridharan, Sudhanva Gurumurthi and Sudhakar Yalamanchili. “Software-based Dynamic Reliability Management for GPU Applications.” IEEE Workshop in Silicon Errors in Logic – System Effects (SELSE 2015). March 2015.
Abstract
In this paper we advocate a framework for dynamic reliability management (DRM) for GPU applications based on the idea of plug-n-play software-based reliability enhancement (SRE). The approach entails first assessing the vulnerability of GPU kernels to soft errors in program visible structures. This assessment is performed on a low level intermediate program representation rather than the application source. Second, this assessment guides selective injection of code implementing SRE techniques to protect the most vulnerable data. Code injection oc- curs transparently at runtime using a just-in-time (JIT) compiler. Thus, reliability enhancement is selective, transparent, on-demand, and customizable. We argue this flexible, automated software- based DRM framework can provide an important, cost-effective approach to scaling reliability of large systems. We present the results of a proof of concept implementation on NVIDIA GPUs demonstrating the ability to traverse a range of performance reliability tradeoffs..
Download
paper [PDF]
Citation
@inproceedings{li-selse2015,
author={Si Li and Vilas Sridharan and Sudhanva Gurumurthi and Sudhakar Yalamanchili},
booktitle={IEEE Workshop in Silicon Errors in Logic – System Effects (SELSE 2015)},
title={Software-based Dynamic Reliability Management for GPU Applications},
year={2015},
month={March},
}
author={Si Li and Vilas Sridharan and Sudhanva Gurumurthi and Sudhakar Yalamanchili},
booktitle={IEEE Workshop in Silicon Errors in Logic – System Effects (SELSE 2015)},
title={Software-based Dynamic Reliability Management for GPU Applications},
year={2015},
month={March},
}