skip to main content
10.1145/3442381.3450022acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Have You been Properly Notified? Automatic Compliance Analysis of Privacy Policy Text with GDPR Article 13

Published: 03 June 2021 Publication History

Abstract

With the rapid development of web and mobile applications, as well as their wide adoption in different domains, more and more personal data is provided, consciously or unconsciously, to different application providers. Privacy policy is an important medium for users to understand what personal information has been collected and used. As data privacy protection is becoming a critical social issue, there are laws and regulations being enacted in different countries and regions, and the most representative one is the EU General Data Protection Regulation (GDPR). It is thus important to detect compliance issues among regulations, e.g., GDPR, with privacy policies, and provide intuitive results for data subjects (i.e., users), data collection party (i.e., service providers) and the regulatory authorities. In this work, we target to solve the problem of compliance analysis between GDPR (Article 13) and privacy policies. We format the task into a combination of a sentence classification step and a rule-based analysis step. We manually curate a corpus of 36,610 labeled sentences from 304 privacy policies, and benchmark our corpus with several standard sentence classifiers. We also conduct a rule-based analysis to detect compliance issues and a user study to evaluate the usability of our approach. The web-based tool AutoCompliance is publicly accessible 1.

References

[1]
2011. Data-Driven Documents. https://d3js.org/ (access on 2020.10.17).
[2]
2016. General Data Protection Regulation. https://gdpr-info.eu/ (access on 2020.10.19).
[3]
2018. California Consumer Privacy Act in America. https://oag.ca.gov/privacy/ccpa (access on 2020.10.19).
[4]
2018. Data Protection Act. https://www.gov.uk/data-protection (access on 2020.10.19).
[5]
Waleed Ammar, Shomir Wilson, Norman Sadeh, and Noah A Smith. 2012. Automatic categorization of privacy policies: A pilot study. School of Computer Science, Language Technology Institute, Technical Report CMU-LTI-12-019(2012).
[6]
Vinayshekhar Bannihatti Kumar, Roger Iyengar, Namita Nisal, Yuanyuan Feng, Hana Habib, Peter Story, Sushain Cherivirala, Margaret Hagan, Lorrie Cranor, Shomir Wilson, 2020. Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. In Proceedings of The Web Conference 2020. 1943–1954.
[7]
Cheng Chang, Huaxin Li, Yichi Zhang, Suguo Du, Hui Cao, and Haojin Zhu. 2019. Automated and Personalized Privacy Policy Extraction Under GDPR Consideration. In WASA 2019. Springer, 43–54.
[8]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273–297.
[9]
Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2018. We Value Your Privacy... Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. arXiv preprint arXiv:1808.05096(2018).
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[11]
Armin Gerl and Bianca Meier. 2019. The Layered Privacy Language Art. 12–14 GDPR Extension–Privacy Enhancing User Interfaces. Datenschutz und Datensicherheit-DuD 43, 12 (2019), 747–752.
[12]
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, 273–278.
[13]
Nils Gruschka, Vasileios Mavroeidis, Kamer Vishi, and Meiko Jensen. 2018. Privacy issues and data protection in big data: a case study analysis under GDPR. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 5027–5033.
[14]
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and Understanding the Effectiveness of BERT. In EMNLP/IJCNLP.
[15]
Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G Shin, and Karl Aberer. 2018. Polisis: Automated analysis and presentation of privacy policies using deep learning. In USENIX Security 18. 531–548.
[16]
Jasmin Kaur, Rozita A Dara, Charlie Obimbo, Fei Song, and Karen Menard. 2018. A comprehensive keyword analysis of online privacy policies. Information Security Journal 27, 5-6 (2018), 260–275.
[17]
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.
[18]
Logan Lebanoff and Fei Liu. 2018. Automatic detection of vague words and sentences in privacy policies. In EMNLP 2018.
[19]
Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 2018. The privacy policy landscape after the GDPR. arXiv preprint arXiv:1809.08396(2018).
[20]
Fei Liu, Rohan Ramanath, Norman Sadeh, and Noah A Smith. 2014. A step towards usable privacy policy: Automatic alignment of privacy statements. In COLING 2014. 884–894.
[21]
Aleecia M McDonald and Lorrie Faith Cranor. 2008. The cost of reading privacy policies. ISJLP 4(2008), 543.
[22]
Najmeh Mousavi Nejad, Simon Scerri, and Jens Lehmann. 2018. Knight: Mapping privacy policies to gdpr. In European Knowledge Acquisition Workshop. Springer, 258–272.
[23]
Monica Palmirani, Michele Martoni, Arianna Rossi, Cesare Bartolini, and Livio Robaldo. 2018. PrOnto: Privacy ontology for legal reasoning. In International Conference on Electronic Government and the Information Systems Perspective. Springer, 139–152.
[24]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP 2014. 1532–1543.
[25]
Juan Ramos 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. 133–142.
[26]
Irwin Reyes, Primal Wijesekera, Joel Reardon, Amit Elazari Bar On, Abbas Razaghpanah, Narseo Vallina-Rodriguez, and Serge Egelman. 2018. “Won’t Somebody Think of the Children?” Examining COPPA Compliance at Scale. Proceedings on Privacy Enhancing Technologies 2018, 3(2018), 63–83.
[27]
David Sarne, Jonathan Schler, Alon Singer, Ayelet Sela, and Ittai Bar Siman Tov. 2019. Unsupervised Topic Extraction from Privacy Policies. In WWW 2019. ACM, 563–568.
[28]
Kanthashree Mysore Sathyendra, Florian Schaub, Shomir Wilson, and Norman M Sadeh. 2016. Automatic Extraction of Opt-Out Choices from Privacy Policies. In AAAI Fall Symposia.
[29]
Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017. Identifying the provision of choices in privacy policy text. In EMNLP 2017. 2774–2779.
[30]
Peter Story, Sebastian Zimmeck, and Norman Sadeh. 2018. Which apps have privacy policies?. In Annual Privacy Forum. Springer, 3–23.
[31]
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification?. In China National Conference on Chinese Computational Linguistics. Springer, 194–206.
[32]
Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and Jetzabel Serna. 2018. I Read but Don’t Agree: Privacy Policy Benchmarking using Machine Learning and the EU GDPR. In Companion Proceedings of the The Web Conference. 163–166.
[33]
Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2. Association for Computational Linguistics, 90–94.
[34]
Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N Cameron Russell, 2016. The creation and analysis of a website privacy policy corpus. In ACL 2016 (Volume 1: Long Papers). 1330–1340.
[35]
Jie Yang, Yue Zhang, Linwei Li, and Xingxuan Li. 2017. YEDDA: A lightweight collaborative text span annotation tool. arXiv preprint arXiv:1711.03759(2017).
[36]
Le Yu, Tao Zhang, Xiapu Luo, Lei Xue, and Henry Chang. 2016. Toward automatically generating privacy policy for android apps. IEEE Transactions on Information Forensics and Security 12, 4(2016), 865–880.
[37]
Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N Cameron Russell, and Norman Sadeh. 2019. MAPS: Scaling privacy compliance analysis to a million apps. PoPETs 2019, 3 (2019), 66–86.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '21: Proceedings of the Web Conference 2021
April 2021
4054 pages
ISBN:9781450383127
DOI:10.1145/3442381
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compliance Analysis
  2. Natural Language Processing
  3. Privacy

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '21
Sponsor:
WWW '21: The Web Conference 2021
April 19 - 23, 2021
Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)292
  • Downloads (Last 6 weeks)11
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media