GenoPri: Quantifying and Protecting the Privacy of Genomic Data

EU   Principal Investigator: Erman Ayday
  Funding: European Commission H2020-MSCA-IF
  Budget: € 157,845.60
  Duration: May 2016 - 2018

Abstract

Genomic data carries a lot of sensitive information about its owner such as his predispositions to sensitive diseases, ancestors, physical attributes, and genomic data of his relatives (leading to interdependent privacy risks). Individuals share vast amount of information on the Web, and some of this information can be used to infer their genomic data. Hence, there is a need to clearly understand the privacy risks on genomic data of individuals considering publicly available information on the Web. It is also crucial to protect genomic privacy of individuals without compromising the utilization of genomic data in research and healthcare.

The two main objectives of this project are (i) to develop a new unifying framework for quantification of genomic privacy of individuals and (ii) to establish a complete framework for privacy-preserving utilization, sharing, and verification of genomic data under real-life threat models. Graph-based, iterative algorithms to efficiently analyze big data and make inference from it will be the foundation for the new quantification framework. To achieve the holistic genomic privacy objective, cryptographic tools, techniques from information theory, and statistics (e.g., differential privacy) will be used.

This project will be a significant step towards understanding the privacy risks on genomic data of individuals and protecting the privacy of genomic data. It will also provide a new vision for security and privacy of health-related data in general and will find many implications in other domains such as banking and online social networks. The results of the project will also have an impact on future policies and legislation about protection of health-related data.

Work Done

In the first workpackage, we developed a new unifying framework to quantify genomic privacy of individuals and significantly contributed to the state-of-the-art. First we showed how to match profiles of users from different platforms (and hence de-anonymize individuals). This developed algorithm can also be used for de-anonymizing the profiles of individuals from genome sharing websites. Then, we showed how to infer the missing parts of the genomes of individuals. Our results show that the attacker’s inference power (on the genomic data of individuals) significantly improves by using complex correlations and phenotype information (along with information about their family bonds). We believe that this work would be a significant step towards establishing a greater understanding of the privacy risks on the genomic data of individuals.

In the second workpackage, we developed techniques that provide recommendations such as how much genomic data to share, what regions of the DNA to share publicly, and what and how much information to share on health-related websites, social networks, and genealogy websites without compromising genomic privacy of individuals. First, we proposed an optimization-based framework for the sharing of genomic data in public datasets while protecting inference of kinship relationships between individuals. This work is in particular very relevant to the privacy concerns of individuals that have raised due to the way Golden State killer was captured recently. Our work is the first in the literature to propose a solution to this privacy leakage problem. Next, we developed a differential privacy-based framework for sharing individuals’ genomic data while preserving their privacy. Different from existing differential privacy-based solutions for genomic data (which consider privacy-preserving release of summary statistics), we focused on privacy-preserving sharing of actual genomic data. As opposed to traditional differential privacy-based data sharing schemes, the proposed scheme does not intentionally add noise to data; it is based on selective sharing of data points. The proposed framework can be seen as a new formulation of differential privacy (that does not rely on noise addition as opposed to existing schemes) for genomic data sharing. We think that it will also have implications in other domains as well.

In the third workpackage, we developed develop techniques to support privacy-compliant credibility check of genomic data even when it is partially shared. We also developed techniques to address the liability issues of genomic data when it is shared without the authorization of its owner. Thus, first, we proposed a scheme that is based on both homomorphic signature and aggregate signature that links the information about the legitimacy of the data to the consent and the phenotype (or the identity) of the individual. Thus, in order to verify the data, a party also needs to use the correct consent and phenotype of the individual who owns the data. We emphasize that the proposed scheme can be easily adopted by existing works on privacy-preserving processing of genomic data in order to have a complete pipeline. Next, we proposed a novel optimization-based watermarking scheme for sharing of genomic data. In the case of an unauthorized sharing of sensitive data, the proposed scheme can find the source of the leakage by checking the watermark inside the leaked data. The proposed schemes guarantees with a high probability that (i) the malicious service provider (SP) that receives the data cannot understand the watermarked data points, (ii) when more than one malicious SPs aggregate their data, they still cannot determine the watermarked data points, (iii) even if the unauthorized sharing involves only a portion of the original data or modified data (to damage the watermark), the corresponding malicious SP can be kept responsible for the leakage, and (iv) the added watermark is compliant with the nature of the corresponding data. We believe that the proposed techniques will help both the users and service providers while sharing and collecting genomic data.

In the fourth workpackage, we explored the privacy risks on interactive genomic databases. Initially, we focused on genomic data sharing beacons. We proposed a novel re-identification attack and showed that the privacy risk is more serious than previously thought. Our attack needs less than 0.5% of the number of queries that existing works require, to determine beacon membership under the same conditions. We further showed that countermeasures such as hiding certain parts of the genome or setting a query budget for the user would fail to protect the privacy of the participants under our adversary model. In an ongoing work, we are working on other scenarios for the identified attack and also working on protection techniques that would help us to develop dynamic access control for genome sharing beacons.

In the last workpackage, we developed privacy-preserving genomic data sharing and utilization techniques between different entities. Notably, (i) we developed a privacy-preserving solution for compressed storage of raw genomic data that outperforms all existing techniques (both in terms of storage overhead and privacy), (ii) we developed, for the first time, a system with one-time programming functionality for genomic testing, and (iii) we developed a system for brute-force resilient management of healthcare (and also genomic) data.

Contributions

Overall, this project had a positive impact on the European Union. The project results have provided a new vision for protection of healthcare data. The project idea and results have been presented to several research groups in the EU (including Luxembourg, Belgium, France, and Norway). The project has also contributed contributed toward European policies on data protection. Via an invited talk from the EU, Dr. Ayday presented his research ideas about GDPR and provided recommendations.

Dr. Ayday also reached out to potential users of the project results. Due to an industrial collaborations that has launched due to this project, a new project has raised about secure storage and processing of electronic health records (EHRs) with a prominent company in Turkey. He also collaborated with Ege University (Turkey) in order to use the techniques he proposed in a biometric authentication application.

This project is a significant step towards understanding the privacy risks on genomic data of individuals and protecting the privacy of genomic data. It provides a new vision for security and privacy of health-related data in general and will find many implications in other domains such as banking and online social networks. The results of the project also have an impact on future policies and legislation about protection of health-related data.


Publications

[1] A. Halimi and E. Ayday. "Profile Matching Across Unstructured Online Social Networks: Threats and Countermeasures", arXiv preprint, arXiv:1711.01815, 2017.

[2] V. Kucuk and E. Ayday. "Profile Matching Across Unstructured Online Social Networks", In Proceedings of International Workshop on Inference & Privacy in a Hyperconnected World (collocated with SPW 2016), 2016.

[3] I. Daznabi, M. Mobayen, N. Jafari, O. Tastan, and E. Ayday. "An Inference Attack on Genomic Data Using Kinship, Complex Correlations, and Phenotype Information", IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017.

[4] M. Humbert, E. Ayday, A. Telenti, and J. P. Hubaux, "Quantifying Interdependent Risks in Genomic Privacy", ACM Transaction on Privacy and Security (formerly known as TISSEC), Volume 20, No. 1, Feb. 2017.

[5] G. Kale, E. Ayday, and O. Tastan. "A Utility Maximizing and Privacy Preserving Approach for Protecting Kinship in Genomic Databases", Bioinformatics, 2017.

[6] M. Mobayen and E. Ayday. "Differential Privacy for Genomic Data Sharing," submitted for conference publication.

[7] E. Ayday, Q. Tang, and A. Yilmaz. "Cryptographic Solutions for Credibility and Liability Issues of Genomic Data", IEEE Transactions on Dependable and Secure Computing, Apr. 2017.

[8] A. Yilmaz and E. Ayday. "Collusion-Secure Watermarking for Sequential Data", arXiv preprint arXiv:1708.01023, 2017.

[9] N. V. Thenen, A. E. Cicek, and E. Ayday. "Inference Attacks Against Genomic Data-Sharing Beacons", in 4th International Workshop on Genome Privacy and Security, 2017.

[10] N. Thenen, E. Ayday, and E. Cicek. "Re-Identification of Individuals in Genomic Data-Sharing Beacons using High-Order Markov Chains", to appear in Bioinformatics, 2018.

[11] Z. Huang, E. Ayday, H. Lin, et al. "A Privacy-Preserving Solution for Compressed Storage and Selective Retrieval of Genomic Data", Genome Research, 26: 1687-1696, 2016.

[12] J. Choi, V. Zhao, D. Demirag, J. Clark, M. Mannan, E. Ayday, and K. Butler. "One Time Programming Made Practical," submitted for conference publication.

[13] S. E. Dilmaghani and E. Ayday. "Privacy-Preserving Personal Health Records against Brute-Force Attacks," to be submitted for conference publication.

[14] E. Ayday and M. Humbert, "Inference Attacks Against Kin Genomic Privacy", IEEE Security and Privacy Magazine, Volume 15, No. 5, Sep. 2017.

[15] E. Ayday, "Cryptographic Solutions for Genomic Privacy", In Proceedings of International Conference on Financial Cryptography and Data Security (FC), 2016.

All publication are accessible at the following link.


Invited Talks

"Privacy and Security in the Genomic Era", Invited talk at IPAM Workshop on Algorithmic Challenges in Protecting Privacy for Biomedical Data, Los Angeles, CA, USA, Jan. 2018.

"Privacy and Security in the Genomic Era", Scientific Keynote at European Cyber Week, Rennes, France, Nov. 2017.

"Holistic Privacy and Security in the Age of Big Data: From Social Networks to Digital Medicine", Stony Brook University, Stony Brook, NY, USA, May 2017.

"Holistic Privacy and Security in the Age of Big Data: From Social Networks to Digital Medicine", Texas A&M University, College Station, TX, USA, Apr. 2017.

"Holistic Privacy and Security in the Age of Big Data: From Social Networks to Digital Medicine", Case Western Reserve University, Cleveland, OH, USA, Mar. 2017.

"Holistic Privacy and Security in the Age of Big Data: From Social Networks to Digital Medicine", University of Alberta, Edmonton, Canada, Mar. 2017.

"Holistic Privacy and Security in the Age of Big Data: From Social Networks to Digital Medicine", Rochester Institute of Technology, Rochester, NY, USA, Mar. 2017.

"Holistic Privacy and Security in the Age of Big Data: From Social Networks to Digital Medicine", University of Iowa, Iowa City, IA, USA, Mar. 2017.

"Security and Privacy in the Age of Big Data", University of Lancaster, Lancaster, UK, Dec. 2016.

"Protecting Sensitive Data", invited talk at the European Parliament, Brussels, Belgium, Mar. 2016.

"Cryptographic Solutions for Genomic Privacy", keynote speech at Workshop on Encrypted Computing and Applied Homomorphic Cryptography – WAHC’16, Barbados, Feb. 2016.

"Security and Privacy in the Age of Big Data", International Cyber Security Workshop and Certificate Program, Istanbul, Turkey, May 2016.