A Conceptual Framework for HPC Operational Data Analytics
- 1 September 2021
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2021 IEEE International Conference on Cluster Computing (CLUSTER)
Abstract
This paper provides a broad framework for understanding trends in Operational Data Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to allow for the continuous monitoring, archiving, and analysis of near real-time performance data, providing immediately actionable information for multiple operational uses. In this work, we combine two models to provide a comprehensive HPC ODA framework: one is an evolutionary model of analytics capabilities that consists of four types, which are descriptive, diagnostic, predictive and prescriptive, while the other is a four-pillar model for energy-efficient HPC operations that covers facility, system hardware, system software, and applications. This new framework is then overlaid with a description of current development and production deployments of ODA within leading-edge HPC facilities. Finally, we perform a comprehensive survey of ODA works and classify them according to our framework, in order to demonstrate its effectiveness.Keywords
Funding Information
- Battelle
This publication has 53 references indexed in Scilit:
- Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing ResourcesComputing in Science & Engineering, 2015
- Energy-Aware Cooling for Hot-Water Cooled SupercomputersPublished by EDAA ,2015
- The PerSyst Monitoring ToolLecture Notes in Computer Science, 2014
- A Case Study of Energy Aware Scheduling on SuperMUCLecture Notes in Computer Science, 2014
- Identifying HPC codes via performance logs and machine learningPublished by Association for Computing Machinery (ACM) ,2013
- A critical analysis of Power Usage Effectiveness and its use in communicating data center energy consumptionEnergy and Buildings, 2013
- HPCTOOLKIT: tools for performance analysis of optimized parallel programsConcurrency and Computation: Practice and Experience, 2009
- RooflineCommunications of the ACM, 2009
- Power-aware dynamic placement of HPC applicationsPublished by Association for Computing Machinery (ACM) ,2008
- Metrics for Parallel Job Scheduling and Their ConvergenceLecture Notes in Computer Science, 2001