Disclaimer: These notes are taken for the CS577 Data Privacy course offered by Dr. Erman Ayday in the 2021/2022 Fall Semester at Bilkent University.
Private Information Retrieval (PIR)
In 2006, AOL Research publicly released more than 20 million searches. By that time, this was one of the biggest privacy violations as it contained personal identifiers of users such as credit card numbers and social security numbers. In addition to that, the data set presented some interesting search patterns by users. One example is the search query history of user17556639.
17556639 how to kill your wife
17556639 how to kill your wife
17556639 wife killer
17556639 how to kill a wife
17556639 poop
17556639 dead people
17556639 pictures of dead people
17556639 killed people
17556639 dead pictures
17556639 dead pictures
17556639 dead pictures
17556639 murder photo
17556639 steak and cheese
17556639 photo of death
17556639 photo of death
17556639 death
17556639 dead people photos
17556639 photo of dead people
17556639 www.murderdpeople.com
17556639 decapitated photos
17556639 decapitated photos
17556639 car crashes3
17556639 car crashes3
17556639 car crash photo
[1] points out that there is no way to tell if this user was involved in a criminal act or doing just research for a new TV show. What happens if the search engine or an attacker hacked into the engine's system alerts the police? Considering the fact that Google processes around 63000 queries per second, how many users per day could be reported to the authorities and lead to be a false positive? To prevent such a crisis, users may wish to hide their data access patterns from databases.
To make things simpler, let us consider a database of n documents. For a user that tries to access one of these documents, the brute-force approach to hide the search query would be each time accessing every single document on the database. For a really large n, like in the case of Amazon S3 Cloud servers, reaching every single document for each query would not be feasible or practical. Hence we need more advanced solutions. Private Information Retrieval (PIR) is a set of protocols that tries to enable a user to privately retrieve information stored in the database efficiently. We can divide PIR into three approaches: (i) information-theoretical (IT-PIR), (ii) computational (cPIR) and (iii) symmetric (sPIR).
Information Theoretical PIR (IT-PIR)
In IT-PIR, the servers cannot determine any information about your query even with unbounded computing power. Such a condition cannot be satisfied with a single server. Hence, for IT-PIR, we assume there exist non-colluding l servers each one of whom holds a copy of the database. The method proposed in [2] allows the user to obtain the desired information by asking queries to l=2^d databases for d ≥ 1. Assuming n=x^d, the user generates uniformly and independently chosen subsets, (S_1^{𝛔1}, S_2^{𝛔2}, ..., S_d^{𝛔d}), the users sends one subset per pair to the corresponding database. The database D_{𝛔1 𝛔2 ... 𝛔d} replies with the exclusive-or of these subsets. Finally, the user exclusive-ors the l = 2^d replies that it received. This scheme is called covering codes and the total communication is given by (dl+2^d-l) * n^{1/d} + l. [2] also improves this method by polynomial interpolation. A run time complexity comparison for both methods is given in Table 1.
A t-private l-server PIR is a PIR system in which the privacy of the query is information-theoretically protected, even if up to t < l of the l servers collude. If only k of the l servers need to respond and no coalition of up to t servers can learn any information about the query, they call such a system t-private k-out-of-l PIR. Going one step further, v of those k servers may return incorrect answers. Even with these incorrect answers, if the user can still obtain the correct answer then the system is called t-private v-Byzantine-robust k-out-of-l PIR. Previously it is shown that such a protocol exists for v ≤ t < k/3. [3] proposes an improved v-Byzantine-robust k-out-of-l PIR.
Computational PIR (cPIR)
The power of cPIR comes from the intractability of the mathematical problem that is used to encrypt the query and the answers. Remember from Module Recap 2 that encryption often requires expansion. Therefore encrypting everything becomes limited by the bandwidth, similar to the limitations of the brute force approach in IT-PIR. In cPIR schemes, the databases are restricted to perform only polynomial-time computations and the privacy of the user's requests is relaxed [4]. The scheme proposed by [4] proceeds as follows:
With this scheme both the user and database run only polynomial time processes and in total k (t+s) bits are transferred. If st= sqrt(n), the the total communication cost is is (2sqrt(n)+1)k.Note that it is better than 𝛀(n) given in [2].
Symmetric PIR (sPIR)
Both IT-PIR and cPIR protect the privacy of the user. However, sPIR also protects the privacy of the database. Hence, sPIR is more challenging since the user is not allowed to learn anything about any other message besides the desired message. Because of that restriction, brute force approaches such as downloading all documents are not acceptable anymore. Although sPIR inherits many of its connections from other types of PIR, it is essentially a distributed form of oblivious transfer [5]. Although the limits on the communication cost of sPIR and various forms of oblivious transfer are still an open question, [5] recently showed that the capacity, i.e. maximum number of bits of desired information that can be privately retrieved per bit of downloaded information, of sPIR is 1-1/N regardless of the number of messages where N is the number of databases.
Oblivious RAM (O-RAM)
Oblivious RAM is an algorithm at the interface of a protected CPU and the physical RAM such that it acts as a RAM to the CPU by querying the physical RAM for the CPU while hiding information about the actual memory access pattern of the CPU from the physical RAM. Today O-RAM attracts more attention due to its potential high impact in privacy-preserving storage outsourcing applications, such as cloud services. One of the best O-RAM schemes known to date is a novel construction recently proposed by Goodrich and Mitzenmacher [6]. Consider a setup where the client wishes to store N blocks each of size B bytes at an untrusted server. The complexity for several O-RAM schemes is given in Table 2 for typical realistic parameters, e.g., when the server stores terabytes of data and the client has several hundred megabytes to gigabytes of local storage, and N ≥ 2^{20}.
In Goldreich and Ostrovsky's scheme [7], the data is organized into several levels as a pyramid. The blocks are continuously shuffled and re-encrypted as they are accessed. The client reads a block from each level, only one is the real block and the rest are dummy blocks. To write the modified block back to the server, filled levels are consecutively shuffled and the modified block is written into the next unfilled level. Then, the source levels are cleared. However, this is an impractical approach due to O(N) bandwidth overhead.
A practical approach is called partition O-RAM. In this approach, the aim is to make shuffling operations cheaper and reduce the worst-case complexity by partitioning the storage [9]. Note that partitioning in a naive way breaks the security as reading a block from its assigned partition results in likability. Therefore, we need to have cache slots on the client-side. Then the client access a block by executing the following steps.
Read from the partition which is previously randomly assigned.
Read and modify block data
Write it to a random cache slot
The blocks are stored in the cache slots until a background eviction process writes them back to the server partition corresponding to its cache slot. The background eviction process is described below.
Sequentially scan the cache slots.
Evict one block if possible.
Evict dummy block otherwise.
However, this method also has some disadvantages. For example, it requires the client to have local storage of the same size as the cloud server.
Path-ORAM
In Path ORAM [10] client only stores a small amount of local data in a \emph{stash} and server-side storage is treated as a binary tree where each node is a bucket that can hold up to a fixed number of blocks. Each block is mapped to a uniformly random leaf bucket in the tree such that unstashed blocks are always placed in some bucket along the path to the mapped leaf. The position of a block is the leaf sequence number and a block must be placed on the path leading to its position.
When a block is read from the server, the entire path to the mapped leaf is read into the stash. During write operation, the requested block is remapped to another leaf, and then the path that was just read is written back to the server.
Consider the example given in Figure 2. To modify block 7, one needs to execute the following steps.
Look up the block's position from the position map. For this example, block 7 is in the 2nd position.
Read the entire path. Blocks 1, 7, 10, 13, and 6 are put into the stash.
Read and modify the desired block.
Assign a new random position to the blocks in the stash.
Write each block to the furthest available intersection of its old and new position. Let us say the new random position for block 7 is 1. Then, modified block 7 should be written to block 13's old position.
If the furthest or deepest intersection is full go to the upper level until an available position is found. If even the root is full, then the block is stored in the stash.
The position map needs to store log(N) bits for each block and it is stored on the client-side. If the map is too large, then recursive path ORAM can be used where the position map is stored in another ORAM.
[1] H. Abelson, K. Ledeen, and H. Lewis Blown to bits: your life, liberty, and happiness after the digital explosion. Addison-Wesley Professional, 2012. [2] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private information retrieval,” in Proceedings of IEEE 36th Annual Foundations of Computer Science, IEEE, 1995, pp. 41–50. [3] I. Goldberg, “Improving the robustness of private information retrieval,” in 2007 IEEE Symposium on Security and Privacy (SP’07), IEEE, 2007, pp. 131–148. [4] E. Kushilevitz and R. Ostrovsky, “Replication is not needed: Single database, computationally-private information retrieval,” in Proceedings 38th annual symposium on foundations of computer science, IEEE, 1997, pp. 364–373.
[5] H. Sun and S. A. Jafar, “The capacity of symmetric private information retrieval,” IEEE Transactions on Information Theory, vol. 65, no. 1, pp. 322–329, 2018.
[6] M. T. Goodrich and M. Mitzenmacher, “Privacy-preserving access of outsourced data via oblivious ram simulation,” in International Colloquium on Automata, Languages, and Programming, Springer, 2011, pp. 576–587.
[7] O. Goldreich and R. Ostrovsky, “Software protection and simulation on oblivious rams,” Journal of the ACM (JACM), vol. 43, no. 3, pp. 431–473, 1996.
[8] B. Pinkas and T. Reinman, “Oblivious ram revisited,” in Annual cryptology conference, Springer, 2010, pp. 502–519.
[9] E. Stefanov, E. Shi, and D. Song, “Towards practical oblivious ram,” arXiv preprint arXiv:1106.3652, 2011.
[10] E. Stefanov, M. Van Dijk, E. Shi, C. Fletcher, L. Ren, X. Yu, and S. Devadas, “Path oram: An extremely simple oblivious ram protocol,” in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, 2013, pp. 299–310.
Comments