Minimizing Thermal Variation Across System Components

1 May 2015

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 1139-1148
https://doi.org/10.1109/ipdps.2015.37

Abstract

Thermal overheating is a serious concern in modern supercomputing systems. Elevated temperature levels reduce the reliability and the lifetime of the underlying hardware and increase their power consumption. Previous studies on mitigating thermal hotspots at the hardware and run-time system levels have typically used approaches that trade off performance for reduced operating temperatures. In this paper, we first show that in a large-scale system, physical attributes cause an uneven temperature distribution. We then develop a model to characterize the thermal behaviour of a complex system using various machine learning methods. We propose to improve application placement by incorporating thermal awareness into the decision-making process. Specifically, our system predicts the thermal condition of the system based on application mapping and uses these predictions to mitigate thermal hotspots without any performance loss. We provide two versions of our prediction mechanism. On a two-node configuration, these models achieve 72.5% and 78.8% success rates in their predictions, respectively. In other words, the scheduling decisions of our models result in a task placement that has a lower maximum average temperature. Overall, the more aggressive scheme reduces the average peak temperature by up to 11.9°C (2.3°C on average) without any performance degradation.

Keywords

This publication has 12 references indexed in Scilit:

Thermal prediction and adaptive control through workload phase detection
ACM Transactions on Design Automation of Electronic Systems, 2013
Argonne applications for the IBM Blue Gene/Q, Mira
IBM Journal of Research and Development, 2013
Exploiting Application/System-Dependent Ambient Temperature for Accurate Microarchitectural Simulation
IEEE Transactions on Computers, 2012
Temperature Aware Load Balancing for Parallel Applications: Preliminary Work
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
The WEKA data mining software
ACM SIGKDD Explorations Newsletter, 2009
Predictive Temperature-Aware DVFS
IEEE Transactions on Computers, 2009
Towards Thermal Aware Workload Scheduling in a Data Center
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Thermal-aware task scheduling at the system software level
Published by Association for Computing Machinery (ACM) ,2007
Software Architecture for Dynamic Thermal Management in Datacenters
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
Weatherman: Automated, Online and Predictive Thermal Mapping and Management for Data Centers
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006

Cited by 17 articles