research-article

Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations

Authors:

Oluwaseyi Feyisetan,

Tom DietheAuthors Info & Claims

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

Pages 178 - 186

https://doi.org/10.1145/3336191.3371856

Published: 22 January 2020 Publication History

Abstract

Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies d_χ-privacy where the privacy parameter $\varepsilon$ provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how $\varepsilon$ can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We conduct privacy audit experiments against $2$ baseline models and utility experiments on 3 datasets to demonstrate the tradeoff between privacy and utility for varying values of varepsilon on different task types. Our results demonstrate practical utility (< 2% utility loss for training binary classifiers) while providing better privacy guarantees than baseline models.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC CCS. ACM, 308--318.

Digital Library

[2]

John M Abowd. 2018. The US Census Bureau Adopts Differential Privacy. Proceedings of the 24th ACM SIGKDD. ACM, 2867--2867.

[3]

Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Local Differential Privacy on Metric Spaces: optimizing the trade-off with utility. In Computer Security Foundations Symposium (CSF) .

[4]

Miguel E Andrés, Nicolás E Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2013. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC CCS. ACM, 901--914.

Digital Library

[5]

Avi Arampatzis, George Drosatos, and Pavlos S Efraimidis. 2015. Versatile query scrambling for private web search. Info. Retrieval Journal, Vol. 18, 4 (2015), 331--358.

Digital Library

[6]

Derek E Bambauer. 2013. Privacy versus security. J. Crim. L. & Criminology, Vol. 103 (2013), 667.

[7]

Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A face is exposed for AOL searcher no. 4417749. New York Times, Vol. 9, 2008 (2006), 8.

[8]

Vincent Bindschaedler, Reza Shokri, and Carl A Gunter. 2017. Plausible deniability for privacy-preserving data synthesis. VLDB Endowment, Vol. 10, 5 (2017), 481--492.

Digital Library

[9]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL, Vol. 5 (2017).

[10]

Claire McKay Bowen and Fang Liu. 2016. Comparative study of differentially private data synthesis methods. arXiv preprint arXiv:1602.01063 (2016).

[11]

Justin Brickell and Vitaly Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In ACM SIGKDD .

[12]

Mark Bun, Jonathan Ullman, and Salil Vadhan. 2018. Fingerprinting codes and the price of approximate differential privacy. SIAM J. Comput. (2018).

[13]

Declan Butler. 2004. US intelligence exposed as student decodes Iraq memo. (2004).

[14]

Konstantinos Chatzikokolakis, Miguel E Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. 2013. Broadening the scope of differential privacy using metrics. In Intl. Symposium on Privacy Enhancing Technologies Symposium .

[15]

Maximin Coavoux, Shashi Narayan, and Shay B. Cohen. 2018. Privacy-preserving Neural Representations of Text. EMNLP .

[16]

Chad Cumby and Rayid Ghani. 2010. Inference control to protect sensitive information in text documents. In ACM SIGKDD WISI. ACM, 5.

[17]

Chad M Cumby and Rayid Ghani. 2011. A Machine Learning Based System for Semi-Automatically Redacting Documents. In IAAI .

[18]

Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In ACM Symposium on Principles of Database Systems . ACM, 202--210.

Digital Library

[19]

Josep Domingo-Ferrer, Agusti Solanas, and Jordi Castellà-Roca. 2009. h (k)-Private information retrieval from privacy-uncooperative queryable databases. Online Information Review, Vol. 33, 4 (2009), 720--744.

[20]

Cynthia Dwork. 2011. A firm foundation for private data analysis. Commun. ACM, Vol. 54, 1 (2011), 86--95.

Digital Library

[21]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In TCC. Springer, 265--284.

[22]

Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In ACM SIGSAC CCS .

Digital Library

[23]

Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, and Bowen Zhou. 2015. Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. 813--820.

[24]

Quan Geng and Pramod Viswanath. 2014. The optimal mechanism in differential privacy. In 2014 IEEE Intl. Symposium on Information Theory (ISIT). 2371--2375.

[25]

Craig Gentry and Dan Boneh. 2009. A fully homomorphic encryption scheme . Vol. 20. Stanford University Stanford.

[26]

Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv:1402.3722 (2014).

[27]

Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In International Conference on Artificial Neural Networks. Springer, 799--804.

Digital Library

[28]

Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proc. of the 2014 ACM SIGMOD .

Digital Library

[29]

Steven Hill, Zhimin Zhou, Lawrence Saul, and Hovav Shacham. 2016. On the (in) effectiveness of mosaicing and blurring as tools for document redaction. Proceedings on Privacy Enhancing Technologies, Vol. 2016, 4 (2016), 403--417.

[30]

Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan, Benjamin C Pierce, and Aaron Roth. Differential privacy: An economic method for choosing epsilon. In Computer Security Foundations Symposium .

[31]

Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput., Vol. 40, 3 (2011).

Digital Library

[32]

Bryan Klimt and Yiming Yang. 2004. The Enron corpus: A new dataset for email classification research. In European Conf. on Machine Learning. Springer, 217--226.

Digital Library

[33]

Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. 2009. Releasing search queries and clicks privately. In WebConf. ACM.

[34]

Hugo Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. In NeurIPS. 2708--2716.

[35]

Tiancheng Li and Ninghui Li. 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD. ACM, 517--526.

Digital Library

[36]

Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the ACL . 142--150.

[37]

Rahat Masood, Dinusha Vatsalan, Muhammad Ikram, and Mohamed Ali Kaafar. 2018. Incognito: A Method for Obfuscating Web Data. In WebConf. 267--276.

[38]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

[39]

Arvind Narayanan and Vitaly Shmatikov. 2008. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy. IEEE, 111--125.

Digital Library

[40]

Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing social networks. In Security and Privacy, 2009 30th IEEE Symposium on . IEEE, 173--187.

Digital Library

[41]

Vijay Pandurangan. 2014. On taxis and rainbows: Lessons from NYC's improperly anonymized taxi logs. (2014).

[42]

HweeHwa Pang, Xuhua Ding, and Xiaokui Xiao. 2010. Embellishing text search queries to protect user privacy. VLDB Endowment, Vol. 3, 1--2 (2010), 598--607.

Digital Library

[43]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In InfoScale, Vol. 152. 1.

[44]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.

[45]

Albin Petit, Thomas Cerqueus, Sonia Ben Mokhtar, Lionel Brunie, and Harald Kosch. 2015. PEAS: Private, efficient and accurate web search. In Trustcom. IEEE.

[46]

Alfréd Rényi. 1961. On measures of entropy and information . Technical Report. HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary.

[47]

David Sánchez and Montserrat Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization. JAIST, Vol. 67, 1 (2016), 148--163.

[48]

David Sánchez, Jordi Castellà-Roca, and Alexandre Viejo. 2013. Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines. Information Sciences, Vol. 218 (2013), 17--30.

Digital Library

[49]

Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In EMNLP. 298--307.

[50]

Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC CCS. ACM, 1310--1321.

[51]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In SP. IEEE.

[52]

Congzheng Song and Vitaly Shmatikov. 2019. Auditing Data Provenance in Text-Generation Models. In ACM SIGKDD . https://arxiv.org/pdf/1811.00513.pdf

[53]

Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv:1511.04108 (2015).

[54]

Abhradeep Guha Thakurta, Andrew H Vyrros, Umesh S Vaishampayan, Gaurav Kapoor, Julien Freudiger, Vivek Rangarajan Sridhar, and Doug Davidson. 2017. Learning new words. (2017). US Patent 9,594,741.

[55]

Anthony Tockar. 2014. Riding with the stars: Passenger privacy in the NYC taxicab dataset. Neustar Research, September, Vol. 15 (2014).

[56]

Giridhari Venkatadri, Athanasios Andreou, Yabing Liu, Alan Mislove, Krishna Gummadi, Patrick Loiseau, and Oana Goga. 2018. Privacy Risks with Facebook's PII-based Targeting: Auditing a Data Broker's Advertising Interface. In IEEE SP .

[57]

Isabel Wagner and David Eckhoff. 2018. Technical privacy metrics: a systematic survey. ACM Computing Surveys (CSUR), Vol. 51, 3 (2018), 57.

Digital Library

[58]

Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In USENIX . 729--745.

[59]

Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. In The 41st International ACM SIGIR Conference (SIGIR '18). ACM, 305--314.

Digital Library

[60]

Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. 2017. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM SIGMOD. 1307--1322.

Digital Library

Cited By

Kim J(2024)Improving Data Utility in Privacy-Preserving Location Data Collection via Adaptive Grid PartitioningElectronics10.3390/electronics1315307313:15(3073)Online publication date: 3-Aug-2024
https://doi.org/10.3390/electronics13153073
Jiang YRezazadeh Baee MSimpson LGauravaram PPieprzyk JZia TZhao ZLe Z(2024)Pervasive User Data Collection from Cyberspace: Privacy Concerns and CountermeasuresCryptography10.3390/cryptography80100058:1(5)Online publication date: 31-Jan-2024
https://doi.org/10.3390/cryptography8010005
Meisenbacher SMatthes F(2024)Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten TextProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3669926(1-11)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3669926
Show More Cited By

Index Terms

Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections

Recommendations

Privacy and Utility Trade-Off for Textual Analysis via Calibrated Multivariate Perturbations
Network and System Security
Abstract
In recent years, the problem of data leakage often appears in our lives. As of today, a number of enterprises have been fined heavily for the leakage of user data, including Facebook, Uber and Equifax. This paper makes deep research on the ...
Preserving Genomic Privacy via Selective Sharing
WPES'20: Proceedings of the 19th Workshop on Privacy in the Electronic Society

Although genomic data has significant impact and widespread usage in medical research, it puts individuals' privacy in danger, even if they anonymously or partially share their genomic data. To address this problem, we present a framework that is ...
Privacy-Preserving WSDM
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

The goals of learning from user data and preserving user privacy are often considered to be in conflict. This presentation will demonstrate that there are contexts when provable privacy guarantees can be an enabler for better web search and data mining (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

January 2020

950 pages

ISBN:9781450368223

DOI:10.1145/3336191

General Chairs:
James Caverlee
Texas A&M University
,
Xia "Ben" Hu
Texas A&M University
,
Program Chairs:
Mounia Lalmas
Spotify
,
Wei Wang
University of California, Los Angeles

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '20

Sponsor:

WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining

February 3 - 7, 2020

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
820
Total Downloads

Downloads (Last 12 months)218
Downloads (Last 6 weeks)27

Reflects downloads up to 09 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim J(2024)Improving Data Utility in Privacy-Preserving Location Data Collection via Adaptive Grid PartitioningElectronics10.3390/electronics1315307313:15(3073)Online publication date: 3-Aug-2024
https://doi.org/10.3390/electronics13153073
Jiang YRezazadeh Baee MSimpson LGauravaram PPieprzyk JZia TZhao ZLe Z(2024)Pervasive User Data Collection from Cyberspace: Privacy Concerns and CountermeasuresCryptography10.3390/cryptography80100058:1(5)Online publication date: 31-Jan-2024
https://doi.org/10.3390/cryptography8010005
Meisenbacher SMatthes F(2024)Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten TextProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3669926(1-11)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3664476.3669926
Meisenbacher SChevli MMatthes FHu HSung AVerma R(2024)1-Diffractor: Efficient and Utility-Preserving Text Obfuscation Leveraging Word-Level Metric Differential PrivacyProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659896(23-33)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3643651.3659896
Zhao YChen J(2024)Vector-Indistinguishability: Location Dependency Based Privacy Protection for Successive Location DataIEEE Transactions on Computers10.1109/TC.2023.323690073:4(970-979)Online publication date: Apr-2024
https://doi.org/10.1109/TC.2023.3236900
Yu JZhou JDing YZhang LGuo YSato H(2024)Textual Differential Privacy for Context-Aware Reasoning with Large Language Model2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00135(988-997)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00135
Kim JJang B(2024)Privacy-preserving generation and publication of synthetic trajectory microdata: A comprehensive surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.103951230(103951)Online publication date: Oct-2024
https://doi.org/10.1016/j.jnca.2024.103951
Tchouka YCouchot JLaiymani DSelles PRahmani A(2024)Differentially private de-identifying textual medical document is compliant with challenging NLP analyses: Example of privacy-preserving ICD-10 code associationIntelligent Systems with Applications10.1016/j.iswa.2024.20041623(200416)Online publication date: Sep-2024
https://doi.org/10.1016/j.iswa.2024.200416
Lu JLeung HXie N(2024)Privacy-preserving data integration and sharing in multi-party IoT environments: An entity embedding perspectiveInformation Fusion10.1016/j.inffus.2024.102380108(102380)Online publication date: Aug-2024
https://doi.org/10.1016/j.inffus.2024.102380
Yin LLin SSun ZWang SLi RHe Y(2024)PriMonitor: An adaptive tuning privacy-preserving approach for multimodal emotion detectionWorld Wide Web10.1007/s11280-024-01246-727:2Online publication date: 2-Feb-2024
https://doi.org/10.1007/s11280-024-01246-7
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents