Disclaimer: These notes are taken for the CS577 Data Privacy course offered by Dr. Erman Ayday in the 2021/2022 Fall Semester at Bilkent University.
Revealing the accurate statistics for a set of records while preserving the privacy of individuals whose records form the set is known as statistical disclosure control in the literature. Most of the time, the statistics are calculated by joining or intersecting multiple data sets. Even if one only uses a single data set to calculate all necessary, the results become meaningful when they are considered with external information. In such cases, preserving the privacy of individuals becomes tricky. Once a data set is publicly shared by authorities, the control over which another source of information with whom it will be analyzed is lost. Hence, it is impractical to decide the correct parameters for any of k-anonymity, l-diversity, or t-closeness and rely on them to ensure some level of privacy. Some models that are used for statistical disclosure control are given in Table 1. While each method has its own advantages, they all do suffer from some inadmissible setbacks.
Differential privacy (DP) is based on the idea of sharing patterns of groups in a data set without revealing information about individuals. It gives a mathematical definition for the privacy loss in cases of data reveal from statistical databases. Some advantages of DP can be listed as:
Quantification of privacy loss (trade-off between accuracy and privacy),
Composition (control of the cumulative loss),
Group Privacy and
Closure Under Post-Processing.
The formal definition of DP made by [1] is as follows. Let ε be a positive real number and K a randomized function. Also, let D1 and D2 be databases that differ on at most one record. K is said to provide ε-differential privacy if Eq 1 is satisfied for S ⊆ Range(K). A more detailed formulation can also be given by Eq 2 where an additive term δ is also included. The choice of ε and δ depends on the application. Smaller ε values imply better privacy. Some popular values for ε in the literature are 0.01, 0.1, ln(2), and ln(3). The aim in terms of data privacy is to find an appropriate K.
For a query function f, the sensitivity of f is given by Eq. 3. The technique works best when the sensitivity of f is small. It basically points out how great a difference for the value of f must be introduced by the additive noise generated by K. Therefore the noise needed to ensure differential privacy depends on the sensitivity of the function and the parameters ε and δ. The best practice is to find a K that requires only a few insensitive queries. Note that the queries that K should be sensitive to, might include standard data mining tasks such as principal component analysis, perceptron training, decision tree generation, and half-space learning.
A well-known K is called the Laplace mechanism. It is developed by Dwork et al. [2]. The name of the method comes from the distribution that the additive noise is drawn from given in Eq. 4. The power of this mechanism lies in its computational simplicity. The formal definition of the mechanism is given in Eq. 5. When λ = 1/ε, the density at μ is proportional to exp(-ε|μ|) and this distribution has the highest density at 0 which is good for accuracy.
However, for some queries, such as mapping databases to strings or strategies, the noise addition might not be meaningful. Then, we may need something called the exponential mechanism. It is a technique proposed by McSherry and Talwar [3] in 2007 to design deferentially private algorithms. Let u(X,y) be the utility function that measures the quality of output y for a given database X. Then the exponential mechanism outputs y with probability proportional to exp(- ε u(X,y) / 2) which ensures ε 𝚫 u differential privacy, i.e. ε -differential privacy whenever 𝚫u ≤ 1.
Recent studies try to integrate differential privacy for deep learning applications. [4] proposes ``private stochastic gradient descent (PSGD)" algorithm to train neural networks. Briefly, this method introduces an additional noise to the loss function by a mechanism called the moments accountant.
[1] C. Dwork, “A firm foundation for private data analysis,” Communications of the ACM, vol. 54, no. 1, pp. 86–95, 2011.
[2] “Differential privacy: A survey of results,” in International conference on theory and ap- plications of models of computation, Springer, 2008, pp. 1–19.
[3]C.Dwork, F. McSherry, K. Nissim, and A.Smith, “Calibratingnoisetosensitivityinprivate
data analysis,” in Theory of cryptography conference, Springer, 2006, pp. 265–284.
[4] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in 48th Annual IEEE Symposium on Foundations of Computer Science(FOCS’07), IEEE, 2007, pp. 94–103.
Comments