skip to main content
10.1145/3336191.3371856acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations

Published: 22 January 2020 Publication History

Abstract

Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the notion of d_χ-privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as defined by word embedding models. We present a privacy proof that satisfies d_χ-privacy where the privacy parameter $\varepsilon$ provides guarantees with respect to a distance metric defined by the word embedding space. We demonstrate how $\varepsilon$ can be selected by analyzing plausible deniability statistics backed up by large scale analysis on GloVe and fastText embeddings. We conduct privacy audit experiments against $2$ baseline models and utility experiments on 3 datasets to demonstrate the tradeoff between privacy and utility for varying values of varepsilon on different task types. Our results demonstrate practical utility (< 2% utility loss for training binary classifiers) while providing better privacy guarantees than baseline models.

References

[1]
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC CCS. ACM, 308--318.
[2]
John M Abowd. 2018. The US Census Bureau Adopts Differential Privacy. Proceedings of the 24th ACM SIGKDD. ACM, 2867--2867.
[3]
Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Local Differential Privacy on Metric Spaces: optimizing the trade-off with utility. In Computer Security Foundations Symposium (CSF) .
[4]
Miguel E Andrés, Nicolás E Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2013. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC CCS. ACM, 901--914.
[5]
Avi Arampatzis, George Drosatos, and Pavlos S Efraimidis. 2015. Versatile query scrambling for private web search. Info. Retrieval Journal, Vol. 18, 4 (2015), 331--358.
[6]
Derek E Bambauer. 2013. Privacy versus security. J. Crim. L. & Criminology, Vol. 103 (2013), 667.
[7]
Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A face is exposed for AOL searcher no. 4417749. New York Times, Vol. 9, 2008 (2006), 8.
[8]
Vincent Bindschaedler, Reza Shokri, and Carl A Gunter. 2017. Plausible deniability for privacy-preserving data synthesis. VLDB Endowment, Vol. 10, 5 (2017), 481--492.
[9]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. TACL, Vol. 5 (2017).
[10]
Claire McKay Bowen and Fang Liu. 2016. Comparative study of differentially private data synthesis methods. arXiv preprint arXiv:1602.01063 (2016).
[11]
Justin Brickell and Vitaly Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In ACM SIGKDD .
[12]
Mark Bun, Jonathan Ullman, and Salil Vadhan. 2018. Fingerprinting codes and the price of approximate differential privacy. SIAM J. Comput. (2018).
[13]
Declan Butler. 2004. US intelligence exposed as student decodes Iraq memo. (2004).
[14]
Konstantinos Chatzikokolakis, Miguel E Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. 2013. Broadening the scope of differential privacy using metrics. In Intl. Symposium on Privacy Enhancing Technologies Symposium .
[15]
Maximin Coavoux, Shashi Narayan, and Shay B. Cohen. 2018. Privacy-preserving Neural Representations of Text. EMNLP .
[16]
Chad Cumby and Rayid Ghani. 2010. Inference control to protect sensitive information in text documents. In ACM SIGKDD WISI. ACM, 5.
[17]
Chad M Cumby and Rayid Ghani. 2011. A Machine Learning Based System for Semi-Automatically Redacting Documents. In IAAI .
[18]
Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In ACM Symposium on Principles of Database Systems . ACM, 202--210.
[19]
Josep Domingo-Ferrer, Agusti Solanas, and Jordi Castellà-Roca. 2009. h (k)-Private information retrieval from privacy-uncooperative queryable databases. Online Information Review, Vol. 33, 4 (2009), 720--744.
[20]
Cynthia Dwork. 2011. A firm foundation for private data analysis. Commun. ACM, Vol. 54, 1 (2011), 86--95.
[21]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In TCC. Springer, 265--284.
[22]
Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In ACM SIGSAC CCS .
[23]
Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, and Bowen Zhou. 2015. Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. 813--820.
[24]
Quan Geng and Pramod Viswanath. 2014. The optimal mechanism in differential privacy. In 2014 IEEE Intl. Symposium on Information Theory (ISIT). 2371--2375.
[25]
Craig Gentry and Dan Boneh. 2009. A fully homomorphic encryption scheme . Vol. 20. Stanford University Stanford.
[26]
Yoav Goldberg and Omer Levy. 2014. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv:1402.3722 (2014).
[27]
Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In International Conference on Artificial Neural Networks. Springer, 799--804.
[28]
Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proc. of the 2014 ACM SIGMOD .
[29]
Steven Hill, Zhimin Zhou, Lawrence Saul, and Hovav Shacham. 2016. On the (in) effectiveness of mosaicing and blurring as tools for document redaction. Proceedings on Privacy Enhancing Technologies, Vol. 2016, 4 (2016), 403--417.
[30]
Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan, Benjamin C Pierce, and Aaron Roth. Differential privacy: An economic method for choosing epsilon. In Computer Security Foundations Symposium .
[31]
Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput., Vol. 40, 3 (2011).
[32]
Bryan Klimt and Yiming Yang. 2004. The Enron corpus: A new dataset for email classification research. In European Conf. on Machine Learning. Springer, 217--226.
[33]
Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. 2009. Releasing search queries and clicks privately. In WebConf. ACM.
[34]
Hugo Larochelle and Stanislas Lauly. 2012. A neural autoregressive topic model. In NeurIPS. 2708--2716.
[35]
Tiancheng Li and Ninghui Li. 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD. ACM, 517--526.
[36]
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the ACL . 142--150.
[37]
Rahat Masood, Dinusha Vatsalan, Muhammad Ikram, and Mohamed Ali Kaafar. 2018. Incognito: A Method for Obfuscating Web Data. In WebConf. 267--276.
[38]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[39]
Arvind Narayanan and Vitaly Shmatikov. 2008. Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy. IEEE, 111--125.
[40]
Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing social networks. In Security and Privacy, 2009 30th IEEE Symposium on . IEEE, 173--187.
[41]
Vijay Pandurangan. 2014. On taxis and rainbows: Lessons from NYC's improperly anonymized taxi logs. (2014).
[42]
HweeHwa Pang, Xuhua Ding, and Xiaokui Xiao. 2010. Embellishing text search queries to protect user privacy. VLDB Endowment, Vol. 3, 1--2 (2010), 598--607.
[43]
Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In InfoScale, Vol. 152. 1.
[44]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.
[45]
Albin Petit, Thomas Cerqueus, Sonia Ben Mokhtar, Lionel Brunie, and Harald Kosch. 2015. PEAS: Private, efficient and accurate web search. In Trustcom. IEEE.
[46]
Alfréd Rényi. 1961. On measures of entropy and information . Technical Report. HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary.
[47]
David Sánchez and Montserrat Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization. JAIST, Vol. 67, 1 (2016), 148--163.
[48]
David Sánchez, Jordi Castellà-Roca, and Alexandre Viejo. 2013. Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines. Information Sciences, Vol. 218 (2013), 17--30.
[49]
Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In EMNLP. 298--307.
[50]
Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC CCS. ACM, 1310--1321.
[51]
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In SP. IEEE.
[52]
Congzheng Song and Vitaly Shmatikov. 2019. Auditing Data Provenance in Text-Generation Models. In ACM SIGKDD . https://arxiv.org/pdf/1811.00513.pdf
[53]
Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv:1511.04108 (2015).
[54]
Abhradeep Guha Thakurta, Andrew H Vyrros, Umesh S Vaishampayan, Gaurav Kapoor, Julien Freudiger, Vivek Rangarajan Sridhar, and Doug Davidson. 2017. Learning new words. (2017). US Patent 9,594,741.
[55]
Anthony Tockar. 2014. Riding with the stars: Passenger privacy in the NYC taxicab dataset. Neustar Research, September, Vol. 15 (2014).
[56]
Giridhari Venkatadri, Athanasios Andreou, Yabing Liu, Alan Mislove, Krishna Gummadi, Patrick Loiseau, and Oana Goga. 2018. Privacy Risks with Facebook's PII-based Targeting: Auditing a Data Broker's Advertising Interface. In IEEE SP .
[57]
Isabel Wagner and David Eckhoff. 2018. Technical privacy metrics: a systematic survey. ACM Computing Surveys (CSUR), Vol. 51, 3 (2018), 57.
[58]
Tianhao Wang, Jeremiah Blocki, Ninghui Li, and Somesh Jha. 2017. Locally differentially private protocols for frequency estimation. In USENIX . 729--745.
[59]
Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. In The 41st International ACM SIGIR Conference (SIGIR '18). ACM, 305--314.
[60]
Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. 2017. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM SIGMOD. 1307--1322.

Cited By

View all

Index Terms

  1. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining
    January 2020
    950 pages
    ISBN:9781450368223
    DOI:10.1145/3336191
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 January 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. differential privacy
    2. plausible deniability
    3. privacy

    Qualifiers

    • Research-article

    Conference

    WSDM '20

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)218
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 09 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media