Semi-supervised anomaly detection on a Tier-0 HPC system

Abstract
Automated and data-driven methodologies are being introduced to assist system administrators in managing increasingly complex modern HPC systems. Anomaly detection (AD) is an integral part of improving the overall availability as it eases the system administrators' burden and reduces the time between an anomaly and its resolution. This work improves upon the current state-of-the-art (SoA) AD model by considering temporal dependencies in the data and including long-short term memory cells in the architecture of the AD model. The proposed model is evaluated on a complete ten-month history of a Tier-0 system (Marconi100 from CINECA consisting of 985 nodes). The proposed model achieves an area under the curve (AUC) of 0.758, improving upon the state-of-the-art approach that achieves an AUC of 0.747.

This publication has 5 references indexed in Scilit: