People
- Haicheng Wu
- Sudhakar Yalamanchili
Collaborators
- Greg Diamos
- Tim Sheard
- Molham Aref
- Martin Bravenboer
- Daniel zinn
Sponsors
Red Fox: An Execution Environment for Relational Query Processing on GPUs
The enterprise software stack is a collection of infrastructure technologies supporting bookkeeping, analytics, planning, and forecasting applications for enterprise data. The task of constructing these applications is challenged by the increasing sophistication of the analysis required and the enormous volumes of data being generated and subject to analysis. Consequently, there is a growing demand for increased productivity in developing applications for sophisticated data analysis tasks such as optimized search, probabilistic computation, and deductive reasoning. This demand is accompanied by the exponential growth in the target data sets and commensurate demands for increased throughput to keep pace with the growth in data volumes. While programming languages and environments have emerged to address productivity and algorithmic issues, the ability to harness modern high performance hardware accelerators such as Graphics Processing Units (GPUs) is nascent at best. This is impeded in large part by the semantic gap between programming languages, models, and environments designed for productivity, and GPU hardware optimized for massive parallelism, speed, and energy efficiency.
Towards this end we propose a system called Red Fox that combines the productivity of declarative languages with the throughput performance of modern high performance GPUs to accelerate relational queries. Queries and constraints in the system are expressed in a high-level logic programming language called LogiQL. The relational data is stored as a key-value store to support a range of workloads corre-sponding to queries over data sets. The target systems are cloud systems comprising high performance multicore processors accelerated with general-purpose graphics processing units (GPGPUs or simply GPUs). Our baseline implementation executes on stock multicore blades that employs a runtime system, LogicBlox, that parcels out work units (e.g., a relational query over a data partition) to cores and manages out-of-core datasets. The envisioned accelerated configuration employs GPUs attached to the node that can also execute such work units where a relational query is comprised of a mix of relational algebra, arithmetic, and logic operators. The challenge is that while such queries appear to exhibit significant data parallelism, this parallelism is generally more unstructured and irregular than encountered in traditionalscientific computations. The result is significant algorithmic and engineering challenges to harnessing the throughput capabilities of GPUs.
In Red Fox, an application is specified in LogiQL, a declarative programming model for database and business analytics applications. This development model for enterprise applications combines transactions with analytics, by using a single expressive, declarative language amenable to efficient evaluation schemes, automatic parallelization, and transactional semantics. The application then passes through a series of compilation stages that progressively lowers LogiQL into primitive operators, primitive operators into computation kernels, and finally kernels are translated into binaries that are executed on the GPU. A set of algorithm skeletons for each operator is provided as a CUDA template library to the compiler. During compilation, the skeletons are instantiated to match the data structure format, types, and low level operations used by a specific operator. The application is then serialized into a binary format that is executed on the GPU by a runtime implementation. Different optimizations can be applied during the compilation of the relational queries. One major implemented optimization module is Kernel Weaver which can automatically apply kernel fusion to primitives to reduce data movement overhead through CPU-GPU memory hierarchy.
Our current efforts are focused on two main thrusts:
1. The management of large data sets whose size exceeds available memory.
2. The efficient and effective compilation of relational queries operating over data partitions to make use of the throughput of GPUs.
Publications
- H. Wu, D. Zinn, M. Aref, and S. Yalamanchili. Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs. 2014 International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS). September 2014. paper
- H. Wu, G. Diamos, T. Sheard, M. Aref, S. Baxter, M. Garland, and S. Yalamanchili. Red Fox: An Execution Environment for Relational Query Processing on GPUs. 2014 International Symposium on Code Generation and Optimization (CGO). February 2014. paper
- S. Yalamanchili. Scaling Data Warehousing Applications using GPUs. Second International Workshop on Performance Analysis of Workload Optimized Systems (FastPath), held with ISPASS-2013. April 2013. slides
- G. Diamos, H. Wu, J. Wang, A. Lele, S. Yalamanchili. “Relational Algorithms for Multi-Bulk-Synchronous Processors (short paper).” 18th Symposium on Principles and Practice of Parallel Programming (PPoPP). February 2013. paper
- H. Wu, G. Diamos, S. Cadambi, S. Yalamanchili. “Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation.” 45th International Symposium on Microarchitecture (MICRO). December 2012. paper
- J. Young, H. Wu, S. Yalamanchili. “Satisfying Data-Intensive Queries Using GPU Clusters.” 2nd Workshop on High-Performance Computing meets Databases (HPCDB), held with SC-2012. December 2012. paper
- H. Wu, G. Diamos, J.Wang, S. Cadambi, S. Yalamanchili, S. Chakradhar. “Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.” Multicore and GPU Programming Models, Languages and Compilers Workshop (PLC), held with IPDPS 2012. May 2012. paper
The papers are provided for personal use and are subject to copyright of the publishers.
Latest SF 1 TPC-H Performance (03/11/2014)
Query Plan Google Protocol Buffer Format
Query Plans of All TPC-H Queries
Query | w/ PCIe (ms) | w/o PCIe (ms) |
1 | 227.910424 | 182.9421 |
2 | 26.653246 | 23.504672 |
3 | 99.674513 | 70.728145 |
4 | 57.86555 | 37.301764 |
5 | 80.037306 | 56.318716 |
6 | 111.093176 | 78.355798 |
7 | 78.719872 | 48.757664 |
8 | 120.996111 | 92.740311 |
9 | 147.406996 | 109.94709 |
10 | 120.871551 | 95.064347 |
11 | 14.164316 | 12.40432 |
12 | 245.977377 | 210.386331 |
13 | 75.271428 | 55.799332 |
14 | 136.013636 | 106.979594 |
15 | 70.576322 | 41.722534 |
16 | 15.575172 | 14.291652 |
17 | 101.674666 | 80.16481 |
18 | 34.447502 | 19.361732 |
19 | 187.019087 | 140.990239 |
20 | 50.3217 | 23.257632 |
21 | 86.271616 | 63.418206 |
22 | 18.703622 | 15.110592 |