Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds

Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds

Jeffrey Young, Sudhakar Yalamanchili, “Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds” The 14th IEEE Conference on High Performance Computing and Communications (HPCC). June 2012.

Abstract

Hardware support for Global Address Spaces (GAS) has previously focused on providing efficient access across remote memories, typically using custom interconnects or high-level software layers. New technologies, such as Extoll, HyperShare, and NumaConnect now allow for cheaper ways to build GAS support into the data center, thus making high-performance coherent and non-coherent remote memory access available for standard data center applications.

At the same time, data center designers are currently experimenting with a greater use of accelerators like GPUs to enhance traditionally CPU-oriented processes, such as data warehousing queries for in-core databases. However, there are very few workable approaches for these accelerator clusters that both use commodity interconnects and also support simple multi-node programming models, such as GAS. We propose a new commodity-based approach for supporting non-coherent GAS in accelerator clouds using the HyperTransport Consortium’s HyperTransport over Ethernet (HToE) specification. This work details a system model for using HToE for accelerated data warehousing applications and investigates potential bottlenecks and design optimizations for an HToE network adapter, or HyperTransport Ethernet Adapter (HTEA).

Using a detailed network simulator model and timing measured for queries run on high-end GPUs, we find that the addition of wider deencapsulation pipelines and the use of bulk acknowledgments in the HTEA can improve overall throughput and reduce latency for multiple senders using a common accelerator. Furthermore, we show that the bandwidth of one receiving HTEA can vary from 2.8 Gbps to 24.45 Gbps, depending on the optimizations used, and the inter-HTEA latency for one packet is 1,480 ns. A brief analysis of the path from remote memory to accelerators also demonstrates that the bandwidth of today’s GPUs can easily handle a stream-based computation model using HToE.

Download

Paper [PDF] Slides [PDF]

Citation

@inproceedings{young_commodity_converged_hpcc_12,
author= {Young, Jeffrey and Yalamanchili, Sudhakar},
title= {Commodity Converged Fabrics for Global Address Spaces in Accelerator Clouds},
booktitle= {HPCC-ICESS},
year= {2012},
pages= {303-310}}