logo
    On-the-Fly Hierarchies for Numerical Attributes in Data Anonymization
    4
    Citation
    20
    Reference
    10
    Related Paper
    Citation Trend
    Keywords:
    Microdata (statistics)
    k-Anonymity
    Data anonymization
    Closeness
    On the fly
    Hierarchical clustering
    Information loss
    Data analysts often prefer access to data in the form of original tuples (i.e., microdata), instead of pre-aggregated statistics, since the former offers advantages in information flexibility and availability. Two problems should be addressed before releasing microdata. First, individual’s privacy needs to be adequately protected. In general, the data will be anonymized before sharing. Second, the utility of the anonymized microdata should be maintained and common aggregate queries should be answered with reasonable accuracy. Most existing works on microdata anonymization are based on attribute generalization. Though popular, these approaches have limitations: the generalization of attributes make it difficult to answer typical aggregate queries with reasonable accuracy. This dissertation investigates new techniques to address the limitations of existing approaches. We propose to anonymize microdata through permutation-based approaches. In particular, we first extend existing privacy goals to better fit the protection requirement of numerical data, and develop a scheme to achieve this privacy goal through sensitive attribute permutation. Second, we propose a stronger privacy goal where an attacker can only learn from the microdata that an individual’s sensitive attribute follows a pre-specified target distribution, but nothing more. We combine sensitive attribute permutation and generalization techniques to achieve this goal. To get better query answers when the target distribution is far from that of the original microdata, we further provide mechanisms to allow users to better control the tradeoff between privacy and accuracy. Third, we extend our techniques to anonymize graph data and support the accurate answering of queries that involve graph properties. Specifically, we partition the nodes and relabel (a form of permutation) the nodes within the same partition. Finally, we study anonymization techniques that can support personalized privacy, which allows individuals to flexibly control the privacy protection they desire.
    Microdata (statistics)
    Data anonymization
    k-Anonymity
    Information loss
    Citations (0)
    在数据发布过程中,为了防止隐私泄露,需要对数据的准标识符属性进行匿名化,以降低链接攻击风险,实现对数据所有者敏感属性的匿名保护.现有数据匿名方法都建立在数据无缺失的假设基础上,在数据存在缺失的情况下会直接丢弃相关的记录,造成了匿名化前后数据特性不一致.针对缺失数据匿名方法进行研究,基于k-匿名模型提出面向缺失数据的数据匿名方法KAIM(k-anonymity for incomplete mircrodata),在保留包含缺失记录的前提下,使在同一属性上缺失的记录尽量被分配到同一分组参与泛化.该方法将分组泛化前后的信息熵变化作为距离,基于改进的k-member 算法对数据进行聚类分组,最后通过基于泛化层次的局部泛化算法对组内数据进行泛化.实际数据集的大量实验结果表明,KAIM 造成信息缺损仅为现有算法的43.8%,可以最大程度地保障匿名化前后数据特性不变.;To protect privacy against linking attacks, quasi-identifier attributes of microdata should be anonymized in privacy preserving data publishing. Although lots of algorithms have been proposed in this area, few of them can handle incomplete microdata. Most existing algorithms simply delete records with missing values, causing large information loss. This paper proposes a novel data anonymization approach called KAIM (k-anonymity for incomplete microdata), for incomplete microdata based on k-member algorithm and information entropy distance. Instead of deleting any records, KAIM effectively clusters records with similar characteristics together to minimize information loss, and then generalizes all records with local recording scheme. Results of extensive experiments base on real dataset show that KAIM causes only 43.8% information loss compared with previous algorithms for incomplete microdata, validating that KAIM performs much better than existing algorithms on the utility of anonymized dataset.
    Microdata (statistics)
    Data anonymization
    Information loss
    k-Anonymity
    Data publishing
    Record Linkage
    Unique identifier
    Although a variety of anonymization approaches have been proposed to achieve anonymity during data sharing, few of them can handle incomplete microdata, i.e. microdata with missing values. Directly applying existing approaches to incomplete microdata will incur extensive information loss, due to the existence of missing values. In this paper, we formulate this problem as missing value pollution, and analysis its influences on generalization based algorithms. Then we propose two top-down algorithms named Enhanced Mondrian and Semi-Partition, which achieve high data utility on incomplete microdata. Extensive experiments on real-world data show the effectiveness of our approach.
    Microdata (statistics)
    Data anonymization
    Mondrian
    Information loss
    For business and research oriented works engaging Data Analysis and Cloud services needing qualitative data, many organizations release huge microdata. It excludes an individual's explicit identity marks like name, address and comprises of specific information like DOB, Pin-code, sex, marital status, which can be combined with other public data to recognize a person. This implication attack can be manipulated to acquire any sensitive information from social network platform, thereby putting the privacy of a person in grave danger. To prevent such attacks by modifying microdata, K-anonymization is used. With potentially increasing data, the effective method to anonymize it stands challenging. After series of trails and systematic comparison, in this paper, we propose three best algorithms along with its efficiency and effectiveness. Studies help researchers to identify the relationship between the values of k, degree of anonymization, choosing a quasi-identifier and focus on execution time.
    Microdata (statistics)
    Data anonymization
    k-Anonymity
    Information loss
    Information sensitivity
    Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy-preserving paradigms of k -anonymity and l -diversity. k -anonymity protects against the identification of an individual's record. l -diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) l -diversification is solved by techniques developed for the simpler k -anonymization problem, causing unnecessary information loss. (ii) The anonymization process is inefficient in terms of computational and I/O cost. (iii) Previous research focused exclusively on the privacy-constrained problem and ignored the equally important accuracy-constrained (or dual) anonymization problem. In this article, we propose a framework for efficient anonymization of microdata that addresses these deficiencies. First, we focus on one-dimensional (i.e., single-attribute) quasi-identifiers, and study the properties of optimal solutions under the k -anonymity and l -diversity models for the privacy-constrained (i.e., direct) and the accuracy-constrained (i.e., dual) anonymization problems. Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems in linear time. Finally, we generalize our solutions to multidimensional quasi-identifiers using space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly outperform the existing approaches in terms of execution time and information loss.
    k-Anonymity
    Data anonymization
    Microdata (statistics)
    Information loss
    Heuristics
    Data publishing
    Citations (80)
    Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy preserving paradigms of k-anonymity and l-diversity. k-anonymity protects against the identification of an individual's record. l-diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) The information loss metrics are counter-intuitive and fail to capture data inaccuracies inflicted for the sake of privacy. (ii) l-diversity is solved by techniques developed for the simpler k-anonymity problem, which introduces unnecessary inaccuracies. (iii) The anonymization process is inefficient in terms of computation and I/O cost. In this paper we propose a framework for efficient privacy preservation that addresses these deficiencies. First, we focus on one-dimensional (i.e., single attribute) quasi-identifiers, and study the properties of optimal solutions for k-anonymity and l-diversity, based on meaningful information loss metrics. Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems in linear time. Finally, we generalize our solutions to multi-dimensional quasi-identifiers using space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly outperform the state-of-the-art, in terms of execution time and information loss.
    k-Anonymity
    Data anonymization
    Information loss
    Heuristics
    Microdata (statistics)
    Data publishing
    Information sensitivity
    Identification
    Citations (288)
    GPU computing, nowadays widely and readily available on the cloud, has opened up novel opportunities for the parallelization of computationally-intensive tasks, such as data anonymization. The development of effective techniques that help to guarantee data anonymity is a critical enabler for data sharing activities, as well as to enforce compliance-think about the European GDPR. In this scenario, we focus on personal data stored in microdata sets. Before releasing such microdata to the general public, statistical agencies and the like have to sanitize them by using a variety of Microdata Protection Techniques (MPTs)that aim at keeping data utility while preserving some kind of anonymity. In particular, microaggregation is a specific MPT arisen in the field of statistical disclosure control. We analyze the microaggregation anonymization issues and propose three GPU-based parallel approaches for a well-known microaggregation technique: the Maximum Distance to Average Vector (MDAV)algorithm. The experimental results demonstrate the feasibility of our proposal and emphasize the benefits of using GPUs to speed-up the execution of privacy preserving algorithms for microdata.
    Microdata (statistics)
    k-Anonymity
    Data anonymization
    Data Sharing
    Citations (0)
    To protect privacy against linking attacks,quasi-identifier attributes of microdata should be anonymized in privacy preserving data publishing.Although lots of algorithms have been proposed in this area,few of them can handle incomplete microdata.Most existing algorithms simply delete records with missing values,causing large information loss.This paper proposes a novel data anonymization approach called KAIM(k-anonymity for incomplete microdata),for incomplete microdata based on k-member algorithm and information entropy distance.Instead of deleting any records,KAIM effectively clusters records with similar characteristics together to minimize information loss,and then generalizes all records with local recording scheme.Results of extensive experiments base on real dataset show that KAIM causes only 43.8% information loss compared with previous algorithms for incomplete microdata,validating that KAIM performs much better than existing algorithms on the utility of anonymized dataset.
    Microdata (statistics)
    Data anonymization
    Information loss
    k-Anonymity
    Data publishing
    Record Linkage
    Unique identifier
    Citations (2)
    K-anonymity is a model to protect public released microdata from individual identification. It requires that each record is identical to at least $k-1$ other records in the anonymized dataset with respect to a set of privacy-related attributes. Although it is easy to anonymize the original dataset to satisfy the requirement of $k$-anonymity, it is important to ensure that the anonymized dataset should preserve as much information as possible of the original dataset. To minimize the information loss due to anonymization, it is crucial to group similar data together and then anonymize each group individually. This work compares the performance of two recently proposed clustering-based techniques for k-anonymization, and proposes a hybrid of both techniques to achieve less information loss than each of the original techniques. Experimental results show that the proposed hybrid technique reduces not only the total information loss but also the variance of information loss among groups.
    Information loss
    k-Anonymity
    Microdata (statistics)
    Data anonymization
    Data set
    Identification
    Citations (15)