Disclaimer: These notes are taken for the CS577 Data Privacy course offered by Dr. Erman Ayday in the 2021/2022 Fall Semester at Bilkent University.
The genetic material is the most unique identifier for any living creature. As genome sequencing technologies rapidly improve, it is now possible to collect, store, process, and share genomic data quicker and more efficiently. As some people consider genomic data is a form of traditional health data. However, the privacy issues associated with genomic data are complex. It is so not only because such data is very powerful, but also because it can provide information on more than just the individual from which the data was derived, such as his/her children, parents, siblings, etc [1].
Known Privacy Threats to Genomic Data
Re-identication Threats
In re-identication attacks, the malicious party tries to recover the identities of the individuals in a published human genome data set. The first rule of anonymization of data sets is discarding the identifier. However, genomic data cannot be anonymized by removal of explicit and quasi-identifiers. The malicious party can infer the phenotype of the donor which will lead to her identity. This type of attack uses the correlation between an individual's genomic data and other public information. It is also interesting that the attacker may not even need the entire dataset but can use a machine learning model trained on that set to reveal information.
Phenotype Inference
Information about an individual can also be revealed by a second sample. If an attacker can obtain even a small amount of genomic data from an outside individual, s/he can attempt to determine this individual participating in a clinical study based on the study's anonymized data published online. In general, when an adversary had access to a known participant's genome sequence, they could determine if the participant was in a certain group. Such attacks are also known as attribute disclosure attacks.
One may argue that individual identification from pooled data is hard in practice. Such inference attacks depend upon the ancestry of the participants, the absolute and relative number of people in case and control groups, the number of SNPs, and the availability of the second sample. Thus, the false-positive rates are much higher in practice.
Completion of genetic information from partial data is a well-studied task in genetic studies and is known as genotype imputation. This method takes advantage of the LD between markers and uses reference panels with complete genetic information to restore missing genotypic values in the data of interest. The very same strategies enable the adversary to expose certain regions of interest when only partial access to the DNA data is available.
The nature of the genomic data also provides information about kinship. The privacy of individuals who do not prefer sharing their genetic data could be threatened by the disclosures of their relatives. For instance, if both parents are genotyped, then most of the variants for their offspring can be inferred.The genomic data of family members can also be inferred using data that has been publicly shared by blood relatives and domain-specic knowledge about genomics.
It should be noted that the "Correlation of genomic data" and "Kin privacy breach" attacks are based on different structural aspects of genomic data. While correlation attacks are based on linkage disequilibrium (LD), which is a genetic variation within an individual's genome. a kin privacy breach is caused by genomic correlations among individuals. Moreover, a kin privacy breach can also be realized through phenotype information alone. For instance, a parent's skin color or height can be used to predict their child's skin color or height to some extent.
Other Threats
The paternity or maternity information can be revealed by analyzing DNA samples. For example, an adopted child is able to nd his/her biological mother and/or father using genealogy services and doing an Internet search. This puts the anonymity of egg and sperm donors at risk. In addition, people who had to give their children up to social services and perhaps are somewhat traumatized may find their privacy at risk as well.
Another aspect of threats to genomic data is legal issues and forensics. Forensic techniques
are becoming more promising with the evolving technology. However, abuse of DNA (e.g., to stage crime scenes) have already baed people and law enforcement agencies.
State-of-the-art Solutions in Healthcare
Personalized Medicine
Multiple parties have different concerns in personalized medicine.
Patients are concerned about the privacy of their genomes whereas healthcare organizations are concerned about their reputation and the trust of their clients. And for-prot companies, such as pharmaceutical manufacturers, are concerned about the secrecy of their disease markers (proprietary information of business importance).
A disease risk test can be expressed as a regular expression query taking into account sequencing errors and other properties of sequenced genomic data. Cryptographic schemes have been developed to delegate the intensive computation in such a scheme to a public cloud in a privacy-preserving fashion.
In personalized medicine protocols based on A-PSI, the healthcare organization provides cryptographically authorized disease markers, while the patient supplies her genome. A regulatory authority can also certify the disease markers before they can be used in a clinical setting. Despite its potential, this protocol has certain limitations. First, it is not very efficient in terms of its communication and computation costs. Second, the model assumes that patients store their own genomes, which is not necessarily the case in practice due to high memory requirements.
Figure 1: Cryptographic framework for personalized medicine
It has been suggested that the storage of the homomorphically encrypted variants (e.g., SNPs) can be delegated to a semi-honest third party. A healthcare organization can then request the third party to compute a disease susceptibility test (weighted average of the risk associated with each variant) on the encrypted variants using an interactive protocol involving (i) the patient, (ii) the healthcare organization, and (iii) the third party. Additive homomorphic encryption enables a party with the public key to add ciphertexts or multiply a plaintext constant to a ciphertext. One of the problems with such protocols is that storage of homomorphically encrypted variants requires orders of magnitude more memory than plaintext variants. A second problem is that when an adversary has knowledge of the LD between the genome regions and the nature of the test, the privacy of the patients will decrease when tests are conducted on their homomorphically encrypted variants.
Raw Aligned Genomic Data
The position of the read relative to the reference genome is determined by finding the approximate match on the reference genome. With today's sequencing techniques, the size of such data can be up to 300GB per individual (in the clear), which makes public key cryptography impractical for the management of such data. Symmetric stream cipher and order-preserving encryption provide efficient solutions for storing, retrieving, and processing this large amount of data in a privacy-preserving way. Order-preserving encryption keeps the ordering information in the ciphertexts to enable range queries on the encrypted data and it may not be secure for most practical applications.
Honey encryption is a type of data encryption that "produces a ciphertext, which, when decrypted with an incorrect key as guessed by the attacker, presents a plausible-looking yet incorrect plaintext password or encryption key.
State-of-the-art Solutions in Research
Genome-wide Association Studies (GWAS)
GWAS is one of the most common types of studies performed to learn genome-phenome associations. Recently, it has been suggested that such information can be protected through the application of noise to the data. In particular, differential privacy was recently adapted to compose privacy-preserving query mechanisms for GWAS settings. However, differential privacy is typically based on a mechanism that adds noise (e.g., by using Laplacian noise, geometric noise, or exponential mechanism), and thus requires a very large number of research participants to guarantee acceptable levels of privacy and utility.
A meta-analysis of summary statistics from multiple independent cohorts is required to find associations in a GWAS. However, it is possible for the same participant to be in multiple studies, which can affect the results of a meta-analysis. It has been suggested that one-way cryptographic hashing can be used to identify overlapping participants without sharing individual-level data.
A cryptographic approach for privacy-preserving genome-phenome studies is also proposed. This approach enables privacy-preserving computation of genome-phenome associations when the data is distributed among multiple sites. No site needs to share its data with any other site.
Sequence Comparison
Sequence comparison is widely used in bioinformatics (e.g., in gene finding, motif finding, and sequence alignment). Such comparison is computationally complex. Cryptographic tools such as fully homomorphic encryption (FHE) and secure multiparty computation (SMC) can be used for privacy-preserving sequence comparison. However, they do not scale to a full human genome. Alternatively, more scalable provably secure protocols exploiting public clouds have been proposed. Computation on the public data can be outsourced to a third-party environment (e.g., cloud provider) while computation on sensitive private sections can be performed locally; thus, outsourcing most of the computationally intensive work to the third party. This computation partitioning can be achieved using program specialization which enables concrete execution of public data and symbolic execution of the sensitive data. This protocol takes advantage of the fact that genomic computations can be partitioned into computation on public data and private data, exploiting the fact that 99.5% of the genomes of any two individuals are similar.
Moreover, genome sequences can be transformed into sets of offsets of different nucleotides in the sequence to efficiently compute similarity scores (e.g., Smith-Waterman computations) on outsourced distributed platforms (e.g., volunteer systems). Similar sequences have similar offsets, which provides sufficient accuracy, and many-to-one transformations provide privacy. Although this approach does not provide provable security, it does not leak significant useful information about the original sequences.
Person-level Genome Sequence Records
Several techniques have been proposed for enabling privacy for person-level genome sequences. For instance, SNPs from several genomic regions can be generalized into more general concepts. This generalization makes re-identification of an individual sequence difficult according to a prescribed level of protection. In particular, k-anonymity can be used to generalize the genomic sequences such that a sequence is indistinguishable from at least other (k – 1) sequences. However, such methods are limited in that they only work when there are a large number of sequences with a relatively small number of variations.
It has been shown that additive homomorphic encryption can be used to share encrypted data while still retaining the ability to compute a limited set of queries (e.g., secure frequency count queries which are useful to many analytic methods for genomic data). Yet, this method leaks information in that it reveals the positions of the SNPs, which in turn reveals the type of test being conducted on the data. Moreover, privacy in this protocol comes at a high cost of computation.
Cryptographic hardware at the remote site can be used as a trusted computation base (TCB) to design a framework in which all person-level biomedical data is stored at a central remote server in encrypted form. This enables researchers to compute on shared data without sharing person-level genomic data. This approach is efficient for typical biomedical roach but the trusted hardware tends to have relatively small memory capacities, which dictate the need for load balancing mechanisms.
State-of-the-art Solutions in Legal Issues and Forensics
Paternity Testing
It is based on the high similarity between the genomes of a father and child (99.9%) in comparison to two unrelated human beings (99.5%). It is not known exactly which 0.5% of the human genome is different between two humans, but a properly chosen 1% sample of the genome can determine paternity with high accuracy.
*Under this section, we consider participants of the paternity test who do not want to share any information about their genomes.*
Once genomes of both individuals are sequenced, a privacy-preserving paternity test can be carried out using PSI-Cardinality (PSI-CA), where inputs to the protocol are the sets of nucleotides comprising the genome. The size of the human genome, or even 1% of it, cannot be handled by current PSI and other SMC protocols. However, by exploiting domain knowledge, the computation time can be reduced to 6.8ms and network bandwidth usage to 6.5KB by emulating the Restriction Fragment Length Polymorphism (RFLP) chemical test in software, which reduces the problem of finding the intersection between two sets of size 25. Since the ideal output of a privacy-preserving paternity test should be yes or no, it cannot be obtained using custom PSI protocols, whereas generic garbled circuit-based protocols can be easily modified to add this capability.
Criminal Forensics
Cryptographic approaches have been developed to preserve the privacy of the records that fail to match the evidence from the crime scene. Specifically, DNA records can be encrypted using a key that depends upon certain tests, such that when DNA is collected from a crime scene, the scheme will only allow decryption of the records that match the evidence. Finally, partial homomorphic encryption can be used for privacy-preserving matching of Short Tandem Repeat (STR) DNA profiles in the honest-but-curious model. Such protocols (described in the next section) are useful for identity, paternity, ancestry, and forensic tests.
State-of-the-art Solutions in Direct-to-consumer (DTC) Services
Many DTC companies provide genealogy and ancestry testing. Partial homomorphic encryption can be cleverly used on STR profiles of individuals to conduct (i) common ancestor testing based on the Y chromosome, (ii) paternity test with one parent, (iii) paternity test with two parents, and (iv) identity testing.
Focusing on kin genomic privacy, Humbert et al. [3] build a protection mechanism against the kinship attack that uses DTC genomic data. The paper proposes a multi-dimensional optimization mechanism in which the privacy constraints of the family members are protected and at the same time the utility (amount of genomic data published by the family members) is maximized.
[1] M. Naveed, E. Ayday, E. W. Clayton, J. Fellay, C. A. Gunter, J.-P. Hubaux, B. A. Malin, and X. Wang, "Privacy in the genomic era," ACM Computing Surveys (CSUR), vol. 48, no. 1, pp. 1-44, 2015.
[2] Y. Erlich and A. Narayanan, "Routes for breaching and protecting genetic privacy," Nature Reviews Genetics, vol. 15, no. 6, pp. 409-421, 2014.
[3] M. Humbert, E. Ayday, J.-P. Hubaux, and A. Telenti, "Reconciling utility with privacy in genomics," in Proceedings of the 13th Workshop on Privacy in the Electronic Society, 2014, pp. 11-20.
Comments