skip to main content
10.1145/3576915.3616637acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Read Between the Lines: Detecting Tracking JavaScript with Bytecode Classification

Published: 21 November 2023 Publication History

Abstract

Browsers and extensions that aim to block online ads and tracking scripts predominantly rely on rules from filter lists for determining which resource requests must be blocked. These filter lists are often manually curated by a community of online users. However, due to the arms race between blockers and ad-supported websites, these rules must continuously get updated so as to adapt to novel bypassing techniques and modified requests, thus rendering the detection and rule-generation process cumbersome and reactive (which can result in major delays between propagation and detection). In this paper, we address the detection problem by proposing an automated pipeline that detects tracking and advertisement JavaScript resources with high accuracy, designed to incur minimal false positives and overhead. Our method models script detection as a text classification problem, where JavaScript resources are documents containing bytecode sequences. Since bytecode is directly obtained from the JavaScript interpreter, our technique is resilient against commonly used bypassing methods, such as URL randomization or code obfuscation. We experiment with both deep learning and traditional ML-based approaches for bytecode classification and show that our approach identifies ad/tracking scripts with 97.08% accuracy, significantly outperforming cutting-edge systems in terms of both precision and the level of required features. Our experimental analysis further highlights our system's capabilities, by demonstrating how it can augment filter lists by uncovering ad/tracking scripts that are currently unknown, as well as proactively detecting scripts that have been erroneously added by list curators.

References

[1]
2023. . https://github.com/tofukko/filter/issues/24
[2]
2023. Adblock Plus. https://adblockplus.org/
[3]
2023. Captum.ai. https://captum.ai
[4]
2023. difflib - Helpers for computing deltas. https://docs.python.org/3/library/ difflib.html
[5]
2023. EasyList. https://easylist.to/easylist/easylist.txt
[6]
2023. EasyPrivacy. https://easylist.to/easylist/easyprivacy.txt
[7]
2023. Fanboy's Enhanced Tracking List. https://www.fanboy.co.nz/enhancedstats. txt
[8]
2023. FastText. https://fasttext.cc/
[9]
2023. FastText. https://fasttext.cc/docs/en/faqs.html
[10]
2023. hCaptcha. https://www.hcaptcha.com/
[11]
2023. How to write filters. https://help.adblockplus.org/hc/en-us/articles/ 360062733293
[12]
2023. jsdom. https://github.com/jsdom/jsdom
[13]
2023. Lazy parsing. https://v8.dev/blog/preparser
[14]
2023. Puppeteer. https://github.com/puppeteer/puppeteer.git
[15]
2023. WebGraph. https://github.com/spring-epfl/WebGraph
[16]
Mshabab Alrizah, Sencun Zhu, Xinyu Xing, and Gang Wang. 2019. Errors, Misunderstandings, and Attacks: Analyzing the Crowdsourcing Process of Ad-Blocking Systems. In Proceedings of the Internet Measurement Conference '19.
[17]
Sruti Bhagavatula, Christopher Dunn, Chris Kanich, Minaxi Gupta, and Brian Ziebart. 2014. Leveraging Machine Learning to Improve Unwanted Resource Filtering. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop (AISec '14).
[18]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En- riching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[19]
Brave. 2023. Ad Block engine in Rust. https://github.com/brave/adblock-rust
[20]
Brave. 2023. Brave Browser. https://brave.com
[21]
Quan Chen, Panagiotis Ilia, Michalis Polychronakis, and Alexandros Kapravelos. 2021. Cookie Swap Party: Abusing First-Party Cookies for Web Tracking. In Proceedings of The Web Conference (WWW).
[22]
Quan Chen, Peter Snyder, Ben Livshits, and Alexandros Kapravelos. 2021. Detect- ing Filter List Evasion with Event-Loop-Turn Granularity JavaScript Signatures. In 2021 IEEE Symposium on Security and Privacy (SP). 1715--1729.
[23]
Savino Dambra, Iskander Sanchez-Rola, Leyla Bilge, and Davide Balzarotti. 2022. When Sally Met Trackers: Web Tracking From the Users' Perspective. In 31st USENIX Security Symposium (USENIX Security 22).
[24]
Ha Dao, Johan Mazel, and Kensuke Fukuda. 2021. CNAME Cloaking-Based Tracking on the Web: Characterization, Detection, and Protection. IEEE Transactions on Network and Service Management 18, 3 (2021), 3873--3888.
[25]
Yana Dimova, Gunes Acar, Lukasz Olejnik, Wouter Joosen, and Tom Van Goethem. 2021. The cname of the game: Large-scale analysis of dns-based tracking evasion. arXiv preprint arXiv:2102.09301 (2021).
[26]
Steven Englehardt and Arvind Narayanan. 2016. Online tracking: A 1-million-site measurement and analysis. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 1388--1401.
[27]
Aurore Fass, Michael Backes, and Ben Stock. 2019. Hidenoseek: Camouflaging malicious javascript in benign asts. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1899--1913.
[28]
Imane Fouad, Cristiana Santos, Arnaud Legout, and Nataliia Bielova. 2022. My Cookie is a phoenix: detection, measurement, and lawfulness of cookie respawn- ing with browser fingerprinting. In PETS 2022--22nd Privacy Enhancing Technologies Symposium.
[29]
Mohammad Ghasemisharif, Peter Snyder, Andrius Aucinas, and Benjamin Livshits. 2019. SpeedReader: Reader Mode Made Fast and Private. In The World Wide Web Conference (WWW '19).
[30]
Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE, 80--89.
[31]
gorhill. 2023. uBlock Origin. https://github.com/gorhill/uBlock
[32]
David Gugelmann, Markus Happe, Bernhard Ager, and Vincent Lenders. 2015. An Automated Approach for Complementing Ad Blockers' Blacklists. Proceedings on Privacy Enhancing Technologies 2015 (02 2015).
[33]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[34]
Muhammad Ikram, Hassan Jameel Asghar, Mohamed Ali Kaafar, Balachander Krishnamurthy, and Anirban Mahanti. 2016. Towards seamless tracking-free web: Improved detection of trackers via one-class learning. arXiv preprint arXiv:1603.06289 (2016).
[35]
Umar Iqbal, Steven Englehardt, and Zubair Shafiq. 2021. Fingerprinting the fingerprinters: Learning to detect browser fingerprinting behaviors. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 1143--1161.
[36]
Umar Iqbal, Zubair Shafiq, and Zhiyun Qian. 2017. The ad wars: retrospective measurement and analysis of anti-adblock filter lists. In Proceedings of the 2017 Internet Measurement Conference. 171--183.
[37]
Umar Iqbal, Peter Snyder, Shitong Zhu, Benjamin Livshits, Zhiyun Qian, and Zubair Shafiq. 2020. Adgraph: A graph-based approach to ad and tracker blocking. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 763--776.
[38]
Umar Iqbal, Charlie Wolfe, Charles Nguyen, Steven Englehardt, and Zubair Shafiq. 2022. Khaleesi: Breaker of Advertising and Tracking Request Chains. In 31st USENIX Security Symposium (USENIX Security 22).
[39]
Rie Johnson and Tong Zhang. 2017. Deep Pyramid Convolutional Neural Net- works for Text Categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
[40]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2.
[41]
Andrew J. Kaizer and Minaxi Gupta. 2016. Towards Automatic Identification of JavaScript-Oriented Machine-Based Tracking. In Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics (IWSPA '16).
[42]
Soroush Karami, Panagiotis Ilia, Konstantinos Solomos, and Jason Polakis. 2020. Carnus: Exploring the Privacy Threats of Browser Extension Fingerprinting. In In Proceedings of the 27th Network and Distributed System Security Symposium.
[43]
Soroush Karami, Faezeh Kalantari, Mehrnoosh Zaeifi, Xavier J. Maso, Erik Trickel, Panagiotis Ilia, Yan Shoshitaishvili, Adam Doupé, and Jason Polakis. 2022. Unleash the Simulacrum: Shifting Browser Realities for Robust Extension-Fingerprinting Prevention. In 31st USENIX Security Symposium.
[44]
Tom Kenter, Alexey Borisov, and Maarten De Rijke. 2016. Siamese cbow: Optimizing word embeddings for sentence representations. arXiv preprint arXiv:1606.04640 (2016).
[45]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[46]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774. http://papers.nips.cc/paper/7062-a- unified-approach-to-interpreting-model-predictions.pdf
[47]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[48]
Nick Nikiforakis, Alexandros Kapravelos, Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni Vigna. 2013. Cookieless monster: Exploring the ecosystem of web-based device fingerprinting. In 2013 IEEE Symposium on Secu- rity and Privacy. IEEE, 541--555.
[49]
Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos Markatos. 2019. Cookie synchronization: Everything you always wanted to know but were afraid to ask. In The World Wide Web Conference. 1432--1442.
[50]
Kevin Patel and Pushpak Bhattacharyya. 2017. Towards Lower Bounds on Num- ber of Dimensions for Word Embeddings. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers).
[51]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[52]
Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczyński, and Wouter Joosen. 2018. Tranco: A research-oriented top sites ranking hardened against manipulation. arXiv preprint arXiv:1806.01156 (2018).
[53]
Jeremy Poulain. 2023. nlplay. https://github.com/jeremypoulain/nlplay
[54]
Radim Rehurek and Petr Sojka. 2011. Gensim-python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011).
[55]
Muhammad Fakhrur Rozi, Sangwook Kim, and Seiichi Ozawa. 2020. Deep Neural Networks for Malicious JavaScript Detection Using Bytecode Sequences. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[56]
Iskander Sanchez-Rola and Igor Santos. 2018. Knockin' on Trackers' Door: Large- Scale Automatic Analysis of Web Tracking. In Detection of Intrusions and Malware, and Vulnerability Assessment, Cristiano Giuffrida, Sébastien Bardin, and Gregory Blanc (Eds.). Springer International Publishing, Cham, 281--302.
[57]
Yijun Shao, Stephanie Taylor, Nell Marshall, Craig Morioka, and Qing Zeng- Treitler. [n. d.]. Clinical Text Classification with Word Embedding Features vs. Bag-of-Words Features. In 2018 IEEE International Conference on Big Data.
[58]
Anastasia Shuba and Athina Markopoulou. 2020. Nomoats: Towards automatic detection of mobile tracking. Proceedings on Privacy Enhancing Technologies (2020).
[59]
Anastasia Shuba, Athina Markopoulou, and Zubair Shafiq. 2018. NoMoAds: Effective and Efficient Cross-App Mobile Ad-Blocking. In Proceedings of the Privacy Enhancing Technologies Symposium (PETS), Vol. 2018.
[60]
Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, and Carmela Troncoso. 2022. WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking. In 31st USENIX Security Symposium (USENIX Security 22).
[61]
Suphannee Sivakorn, Angelos D Keromytis, and Jason Polakis. 2016. That's the way the Cookie crumbles: Evaluating HTTPS enforcing mechanisms. In Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic Society.
[62]
Suphannee Sivakorn, Iasonas Polakis, and Angelos D Keromytis. 2016. The cracked cookie jar: HTTP cookie hijacking and the exposure of private information. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 724--742.
[63]
Philippe Skolka, Cristian-Alexandru Staicu, and Michael Pradel. 2019. Anything to hide? studying minified and obfuscated code in the web. In The world wide web conference. 1735--1746.
[64]
Michael Smith, Pete Snyder, Benjamin Livshits, and Deian Stefan. [n. d.]. Sug- arCoat: Programmatically Generating Privacy-Preserving, Web-Compatible Resource Replacements for Content Blocking (CCS '21).
[65]
Peter Snyder, Antoine Vastel, and Ben Livshits. 2020. Who Filters the Filters: Un- derstanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking (SIGMETRICS '20).
[66]
Konstantinos Solomos, Panagiotis Ilia, Soroush Karami, Nick Nikiforakis, and Jason Polakis. 2022. The Dangers of Human Touch: Fingerprinting Browser Extensions through User Actions. In 31st USENIX Security Symposium.
[67]
Konstantinos Solomos, John Kristoff, Chris Kanich, and Jason Polakis. 2021. Tales of favicons and caches: Persistent tracking in modern browsers. In Network and Distributed System Security Symposium (NDSS'21).
[68]
Gaurav Sood. 2021. virustotal: R Client for the virustotal API.
[69]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929--1958.
[70]
Jingxue Sun, Zhiqiu Huang, Ting Yang, Wengjie Wang, and Yuqing Zhang. 2021. A system for detecting third-party tracking through the combination of dynamic analysis and static analysis. In IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).
[71]
O. Tange. 2011. GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine 36, 1 (Feb 2011), 42--47. https://doi.org/10.5281/zenodo.16303
[72]
Weihang Wang, Yunhui Zheng, Xinyu Xing, Yonghwi Kwon, Xiangyu Zhang, and Patrick Eugster. 2016. WebRanz: Web Page Randomization for Better Advertisement Delivery and Web-Bot Prevention. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE).
[73]
Qianru Wu, Qixu Liu, Yuqing Zhang, Peng Liu, and Guanxing Wen. 2016. A Machine Learning Approach for Detecting Third-Party Trackers on the Web. In ESORICS.
[74]
Zhiju Yang, Weiping Pei, Monchu Chen, and Chuan Yue. 2022. WTAGRAPH: Web Tracking and Advertising Detection using Graph Neural Networks. In IEEE Symposium on Security and Privacy.
[75]
Zhiju Yang and Chuan Yue. 2020. A comparative measurement study of web tracking on mobile and desktop environments. Proceedings on Privacy Enhancing Technologies 2020, 2 (2020).
[76]
Zhonghao Yu, Sam Macbeth, Konark Modi, and Josep M. Pujol. 2016. Tracking the Trackers. In Proceedings of the 25th International Conference on World Wide Web (WWW '16).
[77]
David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, Fredrik Wollsén, and Martin Lopatka. 2020. The representativeness of automated web crawls as a surrogate for human browsing. In Proceedings of The Web Conference.

Index Terms

  1. Read Between the Lines: Detecting Tracking JavaScript with Bytecode Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
    November 2023
    3722 pages
    ISBN:9798400700507
    DOI:10.1145/3576915
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 November 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ad/tracking blocking
    2. measurement
    3. privacy
    4. web security

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CCS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

    Upcoming Conference

    CCS '24
    ACM SIGSAC Conference on Computer and Communications Security
    October 14 - 18, 2024
    Salt Lake City , UT , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 366
      Total Downloads
    • Downloads (Last 12 months)366
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media