LogDiver
- 15 June 2015
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
This paper presents LogDiver, a tool for the analysis of application-level resiliency in extreme-scale computing systems. The tool has been implemented to handle data generated by system monitoring tools in Blue Waters, the petascale machine in production at the University of Illinois' National Center for Supercomputing Applications. The tool is able: i) to filter, extract, and classify error data from different sources of information, such as system logs, hardware sensors and workload logs; ii) to extract signals from the categorized errors; iii) to consolidate user application data and decode application and job exit status, highlighting the reasons for the application/job exit; and iv) to correlate application failures with errors using a mix of empirical and analytical techniques. To the best of our knowledge, this is the first tool capable of measuring application-level resiliency in extreme-scale machines. We also demonstrate the power of the tool by showing that XK applications are more vulnerable to failures when compared to XE applications.Keywords
Funding Information
- National Science Foundation (CNS 10-18503 CISE)
- Air Force Research Laboratory (FA8750-11-2-0084)
This publication has 14 references indexed in Scilit:
- Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue WatersPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Predicting job completion times using system logs in supercomputing clustersPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window ApproachLecture Notes in Computer Science, 2013
- Fault prediction under the microscope: A closer look into HPC systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Assessing time coalescence techniques for the analysis of supercomputer logsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Event Log Mining Tool for Large Scale HPC SystemsLecture Notes in Computer Science, 2011
- What Supercomputers Say: A Study of Five System LogsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- BlueGene/L Failure Analysis and Prediction ModelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Filtering Failure Logs for a BlueGene/L PrototypePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Failure data analysis of a large-scale heterogeneous server environmentPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004