Beyond the socket

Open Access

14 October 2017

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

https://doi.org/10.1145/3123939.3124534

Abstract

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.

Keywords

Funding Information

Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, BES-2013-063925)

This publication has 21 references indexed in Scilit:

Implementing directed acyclic graphs with the heterogeneous system architecture
Published by Association for Computing Machinery (ACM) ,2016
Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Memory access patterns
Published by Association for Computing Machinery (ACM) ,2015
Asymmetric NoC Architectures for GPU Systems
Published by Association for Computing Machinery (ACM) ,2015
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
Published by Association for Computing Machinery (ACM) ,2015
Chimera
ACM SIGARCH Computer Architecture News, 2015
Priority-based cache allocation in throughput processors
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Heterogeneous system coherence for integrated CPU-GPU systems
Published by Association for Computing Machinery (ACM) ,2013
Cache coherence for GPU architectures
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Architectural support for operating system-driven CMP cache management
Published by Association for Computing Machinery (ACM) ,2006

Cited by 43 articles