Beyond the socket
Open Access
- 14 October 2017
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.Keywords
Funding Information
- Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, BES-2013-063925)
This publication has 21 references indexed in Scilit:
- Implementing directed acyclic graphs with the heterogeneous system architecturePublished by Association for Computing Machinery (ACM) ,2016
- Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product familyPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- Memory access patternsPublished by Association for Computing Machinery (ACM) ,2015
- Asymmetric NoC Architectures for GPU SystemsPublished by Association for Computing Machinery (ACM) ,2015
- Automatic Parallelization of Kernels in Shared-Memory Multi-GPU NodesPublished by Association for Computing Machinery (ACM) ,2015
- ChimeraACM SIGARCH Computer Architecture News, 2015
- Priority-based cache allocation in throughput processorsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Heterogeneous system coherence for integrated CPU-GPU systemsPublished by Association for Computing Machinery (ACM) ,2013
- Cache coherence for GPU architecturesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Architectural support for operating system-driven CMP cache managementPublished by Association for Computing Machinery (ACM) ,2006