Trojan data layouts
- 26 October 2011
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the 2nd ACM Symposium on Cloud Computing - SOCC '11
Abstract
MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop MapReduce could be improved, for instance, by creating indexes in a non-invasive manner. However, they ignore the impact of the data layout used inside data blocks of Hadoop Distributed File System (HDFS). In this paper, we analyze different data layouts in detail in the context of MapReduce and argue that Row, Column, and PAX layouts can lead to poor system performance. We propose a new data layout, coined Trojan Layout, that internally organizes data blocks into attribute groups according to the workload in order to improve data access times. A salient feature of Trojan Layout is that it fully preserves the fault-tolerance properties of MapReduce. We implement our Trojan Layout idea in HDFS 0.20.3 and call the resulting system Trojan HDFS. We exploit the fact that HDFS stores multiple replicas of each data block on different computing nodes. Trojan HDFS automatically creates a different Trojan Layout per replica to better fit the workload. As a result, we are able to schedule incoming MapReduce jobs to data block replicas with the most suitable Trojan Layout. We evaluate our approach using three real-world workloads. We compare Trojan Layouts against Hadoop using Row and PAX layouts. The results demonstrate that Trojan Layout allows MapReduce jobs to read their input data up to 4.8 times faster than Row layout and up to 3.5 times faster than PAX layout.Keywords
This publication has 24 references indexed in Scilit:
- CheetahProceedings of the VLDB Endowment, 2010
- Hadoop++Proceedings of the VLDB Endowment, 2010
- MRShareProceedings of the VLDB Endowment, 2010
- Runtime measurements in the cloudProceedings of the VLDB Endowment, 2010
- Energy Management for MapReduce ClustersProceedings of the VLDB Endowment, 2010
- MapReduceCommunications of the ACM, 2010
- HadoopDBProceedings of the VLDB Endowment, 2009
- Column-oriented database systemsProceedings of the VLDB Endowment, 2009
- Database partitioning in a cluster of processorsACM Transactions on Database Systems, 1985
- Vertical partitioning algorithms for database designACM Transactions on Database Systems, 1984