Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud
- 26 September 2014
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 64 (8), 2293-2307
- https://doi.org/10.1109/tc.2014.2360516
Abstract
Cloud computing provides promising scalable IT infrastructure to support various processing of a variety of big data applications in sectors such as healthcare and business. Data sets like electronic health records in such applications often contain privacy-sensitive information, which brings about privacy concerns potentially if the information is released or shared to third-parties in cloud. A practical and widely-adopted technique for data privacy preservation is to anonymize data via generalization to satisfy a given privacy model. However, most existing privacy preserving approaches tailored to small-scale data sets often fall short when encountering big data, due to their insufficiency or poor scalability. In this paper, we investigate the local-recoding problem for big data anonymization against proximity privacy breaches and attempt to identify a scalable solution to this problem. Specifically, we present a proximity privacy model with allowing semantic proximity of sensitive values and multiple sensitive attributes, and model the problem of local recoding as a proximity-aware clustering problem. A scalable two-phase clustering approach consisting of a t-ancestors clustering (similar to k-means) algorithm and a proximity-aware agglomerative clustering algorithm is proposed to address the above problem. We design the algorithms with MapReduce to gain high scalability by performing data-parallel computation in cloud. Extensive experiments on real-life data sets demonstrate that our approach significantly improves the capability of defending the proximity privacy breaches, the scalability and the time-efficiency of local-recoding anonymization over existing approaches.Keywords
Funding Information
- Australian Research Council (LP140100816)
- National Science Foundation of China (91318301)
- NSERC
This publication has 33 references indexed in Scilit:
- Minimal MapReduce algorithmsPublished by Association for Computing Machinery (ACM) ,2013
- A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on CloudIEEE Transactions on Parallel and Distributed Systems, 2013
- Clustering very large multi-dimensional datasets with MapReducePublished by Association for Computing Machinery (ACM) ,2011
- Achieving anonymity via clusteringACM Transactions on Algorithms, 2010
- XColor: Protecting general proximity privacyPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- A General Proximity Privacy PrincipleInternational Conference on Data Engineering, 2009
- Workload-aware anonymization techniques for large-scale datasetsACM Transactions on Database Systems, 2008
- K-Anonymization as Spatial Indexing: Toward Scalable and Incremental AnonymizationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2007
- Approximation schemes for Euclidean k-medians and related problemsPublished by Association for Computing Machinery (ACM) ,1998
- Cumulative Frequency FunctionsThe Annals of Mathematical Statistics, 1942