Trojan data layouts

26 October 2011

conference paper
conference paper
Published by Association for Computing Machinery (ACM) in Proceedings of the 2nd ACM Symposium on Cloud Computing - SOCC '11

Abstract

MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.

Keywords

This publication has 24 references indexed in Scilit:

Cheetah
Proceedings of the VLDB Endowment, 2010
Hadoop++
Proceedings of the VLDB Endowment, 2010
MRShare
Proceedings of the VLDB Endowment, 2010
Runtime measurements in the cloud
Proceedings of the VLDB Endowment, 2010
Energy Management for MapReduce Clusters
Proceedings of the VLDB Endowment, 2010
MapReduce
Communications of the ACM, 2010
HadoopDB
Proceedings of the VLDB Endowment, 2009
Column-oriented database systems
Proceedings of the VLDB Endowment, 2009
Database partitioning in a cluster of processors
ACM Transactions on Database Systems, 1985
Vertical partitioning algorithms for database design
ACM Transactions on Database Systems, 1984

Cited by 60 articles