Semi-supervised anomaly detection on a Tier-0 HPC system
- 17 May 2022
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the 19th ACM International Conference on Computing Frontiers
Abstract
Automated and data-driven methodologies are being introduced to assist system administrators in managing increasingly complex modern HPC systems. Anomaly detection (AD) is an integral part of improving the overall availability as it eases the system administrators' burden and reduces the time between an anomaly and its resolution. This work improves upon the current state-of-the-art (SoA) AD model by considering temporal dependencies in the data and including long-short term memory cells in the architecture of the AD model. The proposed model is evaluated on a complete ten-month history of a Tier-0 system (Marconi100 from CINECA consisting of 985 nodes). The proposed model achieves an area under the curve (AUC) of 0.758, improving upon the state-of-the-art approach that achieves an AUC of 0.747.Keywords
This publication has 5 references indexed in Scilit:
- Revealing power, energy and thermal dynamics of a 200PF pre-exascale supercomputerPublished by Association for Computing Machinery (ACM) ,2021
- A Conceptual Framework for HPC Operational Data AnalyticsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2021
- Anomaly Detection and Anticipation in High Performance Computing SystemsIEEE Transactions on Parallel and Distributed Systems, 2021
- Future of HPC: Diversifying HeterogeneityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2021
- Monitoring of Exascale data processingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2019