On-the-Fly Hierarchies for Numerical Attributes in Data Anonymization

Data analysts often prefer access to data in the form of original tuples (i.e., microdata), instead of pre-aggregated statistics, since the former offers advantages in information flexibility and availability. Two problems should be addressed before releasing microdata. First, individual’s privacy needs to be adequately protected. In general, the data will be anonymized before sharing. Second, the utility of the anonymized microdata should be maintained and common aggregate queries should be answered with reasonable accuracy. Most existing works on microdata anonymization are based on attribute generalization. Though popular, these approaches have limitations: the generalization of attributes make it difficult to answer typical aggregate queries with reasonable accuracy. This dissertation investigates new techniques to address the limitations of existing approaches. We propose to anonymize microdata through permutation-based approaches. In particular, we first extend existing privacy goals to better fit the protection requirement of numerical data, and develop a scheme to achieve this privacy goal through sensitive attribute permutation. Second, we propose a stronger privacy goal where an attacker can only learn from the microdata that an individual’s sensitive attribute follows a pre-specified target distribution, but nothing more. We combine sensitive attribute permutation and generalization techniques to achieve this goal. To get better query answers when the target distribution is far from that of the original microdata, we further provide mechanisms to allow users to better control the tradeoff between privacy and accuracy. Third, we extend our techniques to anonymize graph data and support the accurate answering of queries that involve graph properties. Specifically, we partition the nodes and relabel (a form of permutation) the nodes within the same partition. Finally, we study anonymization techniques that can support personalized privacy, which allows individuals to flexibly control the privacy protection they desire.

Microdata (statistics)

Data anonymization

k-Anonymity

Information loss

Source

Cite

Citations (0)

Data Anonymization Approach for Incomplete Microdata

Journal of Software (2014)

Qiyuan Gong Ming Yang Junzhou Luo

在数据发布过程中，为了防止隐私泄露，需要对数据的准标识符属性进行匿名化，以降低链接攻击风险，实现对数据所有者敏感属性的匿名保护.现有数据匿名方法都建立在数据无缺失的假设基础上，在数据存在缺失的情况下会直接丢弃相关的记录，造成了匿名化前后数据特性不一致.针对缺失数据匿名方法进行研究，基于k-匿名模型提出面向缺失数据的数据匿名方法KAIM(k-anonymity for incomplete mircrodata)，在保留包含缺失记录的前提下，使在同一属性上缺失的记录尽量被分配到同一分组参与泛化.该方法将分组泛化前后的信息熵变化作为距离，基于改进的k-member 算法对数据进行聚类分组，最后通过基于泛化层次的局部泛化算法对组内数据进行泛化.实际数据集的大量实验结果表明，KAIM 造成信息缺损仅为现有算法的43.8%，可以最大程度地保障匿名化前后数据特性不变.;To protect privacy against linking attacks, quasi-identifier attributes of microdata should be anonymized in privacy preserving data publishing. Although lots of algorithms have been proposed in this area, few of them can handle incomplete microdata. Most existing algorithms simply delete records with missing values, causing large information loss. This paper proposes a novel data anonymization approach called KAIM (k-anonymity for incomplete microdata), for incomplete microdata based on k-member algorithm and information entropy distance. Instead of deleting any records, KAIM effectively clusters records with similar characteristics together to minimize information loss, and then generalizes all records with local recording scheme. Results of extensive experiments base on real dataset show that KAIM causes only 43.8% information loss compared with previous algorithms for incomplete microdata, validating that KAIM performs much better than existing algorithms on the utility of anonymized dataset.

Microdata (statistics)

Data anonymization

Information loss

k-Anonymity

Data publishing

Record Linkage

Unique identifier

10.3724/sp.j.1001.2013.04411

Cite

Citations (4)

Utility enhanced anonymization for incomplete microdata

2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD) (2016)

Qiyuan Gong Ming Yang Zhouguo Chen Junzhou Luo

Although a variety of anonymization approaches have been proposed to achieve anonymity during data sharing, few of them can handle incomplete microdata, i.e. microdata with missing values. Directly applying existing approaches to incomplete microdata will incur extensive information loss, due to the existence of missing values. In this paper, we formulate this problem as missing value pollution, and analysis its influences on generalization based algorithms. Then we propose two top-down algorithms named Enhanced Mondrian and Semi-Partition, which achieve high data utility on incomplete microdata. Extensive experiments on real-world data show the effectiveness of our approach.

Microdata (statistics)

Data anonymization

Mondrian

Information loss

10.1109/cscwd.2016.7565966

Cite

Citations (6)

An Extensive Study on Data Anonymization Algorithms Based on K-Anonymity

IOP Conference Series Materials Science and Engineering (2017)

M S Simi K. Sankara Nayaki M. Sudheep Elayidom

For business and research oriented works engaging Data Analysis and Cloud services needing qualitative data, many organizations release huge microdata. It excludes an individual's explicit identity marks like name, address and comprises of specific information like DOB, Pin-code, sex, marital status, which can be combined with other public data to recognize a person. This implication attack can be manipulated to acquire any sensitive information from social network platform, thereby putting the privacy of a person in grave danger. To prevent such attacks by modifying microdata, K-anonymization is used. With potentially increasing data, the effective method to anonymize it stands challenging. After series of trails and systematic comparison, in this paper, we propose three best algorithms along with its efficiency and effectiveness. Studies help researchers to identify the relationship between the values of k, degree of anonymization, choosing a quasi-identifier and focus on execution time.

Microdata (statistics)

Data anonymization

k-Anonymity

Information loss

Information sensitivity

10.1088/1757-899x/225/1/012279

Cite

Citations (17)

A framework for efficient data anonymization under privacy and accuracy constraints

ACM Transactions on Database Systems (2009)

Gabriel Ghinita Panagiotis Karras Panos Kalnis Nikos Mamoulis

Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy-preserving paradigms of k -anonymity and l -diversity. k -anonymity protects against the identification of an individual's record. l -diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) l -diversification is solved by techniques developed for the simpler k -anonymization problem, causing unnecessary information loss. (ii) The anonymization process is inefficient in terms of computational and I/O cost. (iii) Previous research focused exclusively on the privacy-constrained problem and ignored the equally important accuracy-constrained (or dual) anonymization problem. In this article, we propose a framework for efficient anonymization of microdata that addresses these deficiencies. First, we focus on one-dimensional (i.e., single-attribute) quasi-identifiers, and study the properties of optimal solutions under the k -anonymity and l -diversity models for the privacy-constrained (i.e., direct) and the accuracy-constrained (i.e., dual) anonymization problems. Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems in linear time. Finally, we generalize our solutions to multidimensional quasi-identifiers using space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly outperform the existing approaches in terms of execution time and information loss.

k-Anonymity

Data anonymization

Microdata (statistics)

Information loss

Heuristics

Data publishing

10.1145/1538909.1538911

Cite

Citations (80)

Fast data anonymization with low information loss

Gabriel Ghinita Panagiotis Karras Panos Kalnis Nikos Mamoulis

Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy preserving paradigms of k-anonymity and l-diversity. k-anonymity protects against the identification of an individual's record. l-diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) The information loss metrics are counter-intuitive and fail to capture data inaccuracies inflicted for the sake of privacy. (ii) l-diversity is solved by techniques developed for the simpler k-anonymity problem, which introduces unnecessary inaccuracies. (iii) The anonymization process is inefficient in terms of computation and I/O cost. In this paper we propose a framework for efficient privacy preservation that addresses these deficiencies. First, we focus on one-dimensional (i.e., single attribute) quasi-identifiers, and study the properties of optimal solutions for k-anonymity and l-diversity, based on meaningful information loss metrics. Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems in linear time. Finally, we generalize our solutions to multi-dimensional quasi-identifiers using space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly outperform the state-of-the-art, in terms of execution time and information loss.

k-Anonymity

Data anonymization

Information loss

Heuristics

Microdata (statistics)

Data publishing

Information sensitivity

Identification

Source

Cite

Citations (288)

Efficient systematic clustering method for k-anonymization

Acta Informatica (2011)

Enamul Kabir Hua Wang Elisa Bertino

k-Anonymity

Information loss

Data anonymization

10.1007/s00236-010-0131-6

Cite

Citations (77)

GPU Algorithms for K-Anonymity in Microdata

Roberto Di Pietro Leonardo Jero Flavio Lombardi Agustí Solanas

GPU computing, nowadays widely and readily available on the cloud, has opened up novel opportunities for the parallelization of computationally-intensive tasks, such as data anonymization. The development of effective techniques that help to guarantee data anonymity is a critical enabler for data sharing activities, as well as to enforce compliance-think about the European GDPR. In this scenario, we focus on personal data stored in microdata sets. Before releasing such microdata to the general public, statistical agencies and the like have to sanitize them by using a variety of Microdata Protection Techniques (MPTs)that aim at keeping data utility while preserving some kind of anonymity. In particular, microaggregation is a specific MPT arisen in the field of statistical disclosure control. We analyze the microaggregation anonymization issues and propose three GPU-based parallel approaches for a well-known microaggregation technique: the Maximum Distance to Average Vector (MDAV)algorithm. The experimental results demonstrate the feasibility of our proposal and emphasize the benefits of using GPUs to speed-up the execution of privacy preserving algorithms for microdata.

Microdata (statistics)

k-Anonymity

Data anonymization

Data Sharing

10.1109/cns.2019.8802735

Cite

Citations (0)

Data Anonymization Approach for Incomplete Microdata

Journal of Software (2013)

Qi Gong

To protect privacy against linking attacks,quasi-identifier attributes of microdata should be anonymized in privacy preserving data publishing.Although lots of algorithms have been proposed in this area,few of them can handle incomplete microdata.Most existing algorithms simply delete records with missing values,causing large information loss.This paper proposes a novel data anonymization approach called KAIM(k-anonymity for incomplete microdata),for incomplete microdata based on k-member algorithm and information entropy distance.Instead of deleting any records,KAIM effectively clusters records with similar characteristics together to minimize information loss,and then generalizes all records with local recording scheme.Results of extensive experiments base on real dataset show that KAIM causes only 43.8% information loss compared with previous algorithms for incomplete microdata,validating that KAIM performs much better than existing algorithms on the utility of anonymized dataset.

Microdata (statistics)

Data anonymization

Information loss

k-Anonymity

Data publishing

Record Linkage

Unique identifier

Source

Cite

Citations (2)

A Hybrid Method for k-Anonymization

Jun-Lin Lin Meng-Cheng Wei Chih-Wen Li Kuo-Chiang Hsieh

K-anonymity is a model to protect public released microdata from individual identification. It requires that each record is identical to at least $k-1$ other records in the anonymized dataset with respect to a set of privacy-related attributes. Although it is easy to anonymize the original dataset to satisfy the requirement of $k$-anonymity, it is important to ensure that the anonymized dataset should preserve as much information as possible of the original dataset. To minimize the information loss due to anonymization, it is crucial to group similar data together and then anonymize each group individually. This work compares the performance of two recently proposed clustering-based techniques for k-anonymization, and proposes a hybrid of both techniques to achieve less information loss than each of the original techniques. Experimental results show that the proposed hybrid technique reduces not only the total information loss but also the variance of information loss among groups.

Information loss

k-Anonymity

Microdata (statistics)

Data anonymization

Data set

Identification

10.1109/apscc.2008.65

Cite

Citations (15)