Disclaimer: These notes are taken for the CS577 Data Privacy course offered by Dr. Erman Ayday in the 2021/2022 Fall Semester at Bilkent University.
k-Anonymity
Anonymity is required when any two parties share a data set to protect the privacy of individuals whose sensitive information is included in the set. When the data holders want to share the data set with other parties, they often remove sensitive data such as name, id, sex, age, and address. However, simply removing some or all sensitive data does not ensure privacy as multiple data sets can be intersected by attackers to de-anonymize the information. Also, an attacker collects partial information with multiple queries and combines results with statistics and some heuristics to reveal the sensitive data. Therefore a model for a more effective anonymization model named k-anonymity is proposed by [1]. This model is mostly based on limiting the information linking ability of parties with quasi-identifiers.
Quasi-identifiers are a set of attributes that does not explicitly single out an individual by themselves, such as name or social security number, but a combination of them can uniquely identify people. Birthdate and gender are good examples of quasi-identifiers. k-anonymity model is based on masking a number of quasi-identifiers to various degrees so that any individual in the data set becomes indistinguishable from at least k-1 different people whose information is also included in the data set.
It is important to note that k-anonymity is still vulnerable to a number of attacks:
Unsorted matching attack: Sensitive information can be released by attackers if the records in the dataset are listed in a specific order. This fallback can be easily corrected by shuffling the order of records randomly.
Complementary release attack: When a data set is subsequently released with different masking schemes, all attributes that may link a released version to another one should also be considered as quasi-identifiers to prevent linking between different versions.
Temporal inference attack: When a recording is added or discarded from the dataset, k-anonymity might not be satisfied anymore. To prevent information leakage in this case, all attributes of the dataset should be considered as a quasi-identifier for subsequent releases, similar to the proposed solution for complementary release attacks.
l-Diversity
Machanavajjhala et al. [2] showed that an attacker can obtain the value of a sensitive attribute for a record if the diversity in that attribute is low (Homogeneity attack). In addition to that, the attacker may have some background knowledge which decreases the privacy level for k-anonymity (Background knowledge attack). To overcome these fallbacks Machanavajjhala et al. proposed l-diversity increase privacy protection in data set release. In l-diversity, more quasi-identifiers are masked or completely discarded until each block of records where a quasi-identifier has the same value for all of the records in that block is represented with at least l different values for the sensitive attribute. In the case of multiple sensitive attributes, l-diversity should be satisfied for each of them. Thus, as the number of sensitive attributes increases, more masking of the quasi-identifiers will be required to achieve the required diversity.
There are multiple versions of l-diversity. The most fundamental versions are entropy l-diversity and recursive (c,l) l-diversity. Although each version formulates the diversity requirement for the sensitive attribute in each block in the dataset, all of them aim to improve the privacy protection of k-anonymity.
t-Closeness
Both k-anonymity and l-diversity assume categorical attributes. However, in the case of numerical attributes, obtaining a range to be close to the value may be sufficient. Assuming that one may get partial information about the distribution of an attribute for a block if the global distribution of the same attribute is different and this difference is statistically significant. Li et al. [3] propose t-closeness as an improved model for privacy protection. This model requires the difference between a global distribution and local distributions, i.e. distribution in blocks, of attributes should be less than a threshold. The distance among distributions is measured by using Earth Mover Distance (EMD) as it does not require both distributions to be on the same probability space, i.e it does not assume one of the distributions to be the prior of the second. The authors show that $t$-closeness is more robust to similarity attack with respect to recursive and entropy $l$-diversity without losing efficiency.
[1] L. Sweeney, “K-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002.
[2] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “L-diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, 3–es, 2007.
[3] N. Li, T. Li, and S. Venkatasubramanian, “T-closeness: Privacy beyond k-anonymity and l-diversity,” in2007 IEEE 23rd International Conference on Data Engineering, IEEE, 2007, pp. 106–115.
Comments