skip to main content
research-article

Filter before you parse: faster analytics on raw data with sparser

Published: 01 July 2018 Publication History

Abstract

Exploratory big data applications often run on raw unstructured or semi-structured data formats, such as JSON files or text logs. These applications can spend 80--90% of their execution time parsing the data. In this paper, we propose a new approach for reducing this overhead: apply filters on the data's raw bytestream before parsing. This technique, which we call raw filtering, leverages the features of modern hardware and the high selectivity of queries found in many exploratory applications. With raw filtering, a user-specified query predicate is compiled into a set of filtering primitives called raw filters (RFs). RFs are fast, SIMD-based operators that occasionally yield false positives, but never false negatives. We combine multiple RFs into an RF cascade to decrease the false positive rate and maximize parsing throughput. Because the best RF cascade is data-dependent, we propose an optimizer that dynamically selects the combination of RFs with the best expected throughput, achieving within 10% of the global optimum cascade while adding less than 1.2% overhead. We implement these techniques in a system called Sparser, which automatically manages a parsing cascade given a data stream in a supported format (e.g., JSON, Avro, Parquet) and a user query. We show that many real-world applications are highly selective and benefit from Sparser. Across diverse workloads, Sparser accelerates state-of-the-art parsers such as Mison by up to 22 × and improves end-to-end application performance by up to 9 ×.

References

[1]
Abouzied, Azza and Abadi, Daniel J and Silberschatz, Avi. Invisible Loading: Access-driven Data Transfer from Raw Files into Database Systems. In Proceedings of the 16th International Conference on Extending Database Technology, pages 1--10. ACM, 2013.
[2]
Agarwal, Rachit and Khandelwal, Anurag and Stoica, Ion. Succinct: Enabling Queries on Compressed Data. In NSDI, pages 337--350, 2015.
[3]
Alagiannis, Ioannis and Borovica, Renata and Branco, Miguel and Idreos, Stratos and Ailamaki, Anastasia. NoDB: Efficient Query Execution on Raw Data Files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 241--252. ACM, 2012.
[4]
Alagiannis, Ioannis and Idreos, Stratos and Ailamaki, Anastasia. H2O: A Hands-free Adaptive Store. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1103--1114, New York, NY, USA, 2014. ACM.
[5]
Armbrust, Michael and Xin, Reynold S and Lian, Cheng and Huai, Yin and Liu, Davies and Bradley, Joseph K and Meng, Xiangrui and Kaftan, Tomer and Franklin, Michael J and Ghodsi, Ali and others. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015.
[6]
Apache Avro. https://avro.apache.org.
[7]
Apache Avro 1.8.1 Specification. https://avro.apache.org/docs/1.8.1/spec.html.
[8]
Intel AVX2. https://software.intel.com/en-us/node/523876.
[9]
Babu, Shivnath and Motwani, Rajeev and Munagala, Kamesh and Nishizawa, Itaru and Widom, Jennifer. Adaptive Ordering of Pipelined Stream Filters. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 407--418. ACM, 2004.
[10]
Bray, Tim. RFC 8259: The Javascript Object Notation (JSON) Data Interchange Format. 2017.
[11]
Bro. https://www.bro.org/.
[12]
Bro Exchange 2013 Malware Analysis. https: //github.com/LiamRandall/BroMalware-Exercise.
[13]
Network Forensics with Bro. http://matthias.vallentin.net/slides/bro-nf.pdf, 2011.
[14]
Understanding and Examining Bro Logs. https://www.bro.org/bro-workshop-2011/solutions/logs/index.html.
[15]
Cameron, Robert D. and Herdy, Kenneth S. and Lin, Dan. High Performance XML Parsing Using Parallel Bit Stream Technology. In Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON '08, pages 17:222--17:235, New York, NY, USA, 2008. ACM.
[16]
Candan, K Selçuk and Hsiung, Wang-Pin and Chen, Songting and Tatemura, Junichi and Agrawal, Divyakant. AFilter: Adaptable XML Filtering with Prefix-caching Suffix-clustering. In Proceedings of the 32nd VLDB, pages 559--570. VLDB Endowment, 2006.
[17]
Censys. Research Access to Censys Data. https://support.censys.io/getting-started/research-access-to-censys-data, 2017.
[18]
Writing a Really, Really Fast JSON Parser. https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/, 2017.
[19]
Cheng, Yu and Rusu, Florin. Parallel In-situ Data Processing with Speculative Loading. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1287--1298. ACM, 2014.
[20]
Cheng, Yu and Rusu, Florin. SCANRAW: A Database Meta-Operator for Parallel In-Situ Processing and Loading. ACM Trans. Database Syst., 40(3):19:1--19:45, Oct. 2015.
[21]
Choi, Byungkwon and Chae, Jongwook and Jamshed, Muhammad and Park, Kyoungsoo and Han, Dongsu. DFC: Accelerating String Pattern Matching for Network Applications. In NSDI, pages 551--565, 2016.
[22]
Wireshark Filters. http://www.lovemytool.com/blog/2010/04/top-10-wireshark-filters-by-chris-greer.html.
[23]
Diao, Yanlei and Altinel, Mehmet and Franklin, Michael J and Zhang, Hao and Fischer, Peter. Path Sharing and Predicate Evaluation for High-Performance XML Filtering. ACM Transactions on Database Systems (TODS), 28(4):467--516, 2003.
[24]
Diao, Yanlei and Franklin, Michael J. High-performance XML Filtering: An Overview of YFilter. IEEE Data Eng. Bull., 26(1):41--48, 2003.
[25]
Durumeric, Zakir and Adrian, David and Mirian, Ariana and Bailey, Michael and Halderman, J Alex. A Search Engine Backed by Internet-wide Scanning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 542--553. ACM, 2015.
[26]
Gallant, Andrew. ripgrep is faster than grep, ag, git grep, ucg, pt, sift. https://blog.burntsushi.net/ripgrep.
[27]
TShark Tutorial and Filter Examples. https://hackertarget.com/tshark-tutorial-and-filter-examples/.
[28]
He, Bingsheng and Luo, Qiong and Choi, Byron. Cache-conscious Automata for XML Filtering. IEEE Transactions on Knowledge and Data Engineering, 18(12):1629--1644, 2006.
[29]
Analyze HTTP Requests with TShark. http://kvz.io/blog/2010/05/15/analyze-http-requests-with-tshark/.
[30]
Idreos, Stratos and Alagiannis, Ioannis and Johnson, Ryan and Ailamaki, Anastasia. Here are my data files. Here are my queries. Where are my results? In Proceedings of 5th Biennial Conference on Innovative Data Systems Research, number EPFL-CONF-161489, 2011.
[31]
Jackson. https://github.com/FasterXML/jackson.
[32]
nativejson-benchmark. https://github.com/miloyip/nativejson-benchmark.
[33]
Karpathiotakis, Manos and Alagiannis, Ioannis and Ailamaki, Anastasia. Fast Queries over Heterogeneous Data through Engine Customization. PVLDB, 9(12):972--983, 2016.
[34]
Karpathiotakis, Manos and Alagiannis, Ioannis and Heinis, Thomas and Branco, Miguel and Ailamaki, Anastasia. Just-in-time Data Virtualization: Lightweight Data Management with ViDa. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR), 2015.
[35]
Karpathiotakis, Manos and Branco, Miguel and Alagiannis, Ioannis and Ailamaki, Anastasia. Adaptive Query Processing on RAW Data. PVLDB, 7(12):1119--1130, 2014.
[36]
Li, Yinan and Katsipoulakis, Nikos R and Chandramouli, Badrish and Goldstein, Jonathan and Kossmann, Donald. Mison: A Fast JSON Parser for Data Analytics. PVLDB, 10(10):1118--1129, 2017.
[37]
libpcap. http://www.tcpdump.org.
[38]
Ma, Lu and Au, Grace Kwan-On. Techniques for Ordering Predicates in Column Partitioned Databases for Query Optimization, July 3 2014. US Patent App. 13/728,345.
[39]
Moussalli, Roger and Halstead, Robert J and Salloum, Mariam and Najjar, Walid A and Tsotras, Vassilis J. Efficient XML Path Filtering Using GPUs. In ADMS@ VLDB, pages 9--18. Citeseer, 2011.
[40]
Moussalli, Roger and Salloum, Mariam and Najjar, Walid and Tsotras, Vassilis J. Massively Parallel XML Twig Filtering using Dynamic Programming on FPGAs. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 948--959. IEEE, 2011.
[41]
Mühlbauer, Tobias and Rödiger, Wolf and Seilbeck, Robert and Reiser, Angelika and Kemper, Alfons and Neumann, Thomas. Instant Loading for Main Memory Databases. PVLDB, 6(14):1702--1713, 2013.
[42]
ARM NEON. https://developer.arm.com/technologies/neon.
[43]
Norton, Marc. Optimizing Pattern Matching for Intrusion Detection. Sourcefire, Inc., Columbia, MD, 2004.
[44]
Olma, Matthaios and Karpathiotakis, Manos and Alagiannis, Ioannis and Athanassoulis, Manos and Ailamaki, Anastasia. Slalom: Coasting through Raw Data via Adaptive Partitioning and Indexing. PVLDB, 10(10):1106--1117, 2017.
[45]
Apache Parquet. https://parquet.apache.org.
[46]
apache/parquet-format. https://github.com/apache/parquet-format.
[47]
Development/LibpcapFileFormat - The Wireshark Wiki. https://wiki.wireshark.org/Development/LibpcapFileFormat.
[48]
Libpcap File Format. https://wiki.wireshark.org/Development/LibpcapFileFormat.
[49]
RapidJSON. https://rapidjson.org.
[50]
Răducanu, Bogdan and Boncz, Peter and Zukowski, Marcin. Micro Adaptivity in Vectorwise. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 1231--1242. ACM, 2013.
[51]
Scheufele, Wolfgang and Moerkotte, Guido. Optimal Ordering of Selections and Joins in Acyclic Queries with Expensive Predicates. RWTH, Fachgruppe Informatik, 1996.
[52]
Spark SQL Data Sources API: Unified Data Access for the Apache Spark Platform. https://databricks.com/blog/2015/01/09/.
[53]
Stylianopoulos, Charalampos and Almgren, Magnus and Landsiedel, Olaf and Papatriantafilou, Marina. Multiple Pattern Matching for Network Security Applications: Acceleration through Vectorization. In Parallel Processing (ICPP), 2017 46th International Conference on, pages 472--482. IEEE, 2017.
[54]
Tahara, Daniel and Diamond, Thaddeus and Abadi, Daniel J. Sinew: a SQL system for Multi-structured Data. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 815--826. ACM, 2014.
[55]
tcpdump. http://www.tcpdump.org.
[56]
Teubner, Jens and Woods, Louis and Nie, Chongling. XLynx: an FPGA-based XML Filter for Hybrid XQuery Processing. ACM Transactions on Database Systems (TODS), 38(4):23, 2013.
[57]
The Apache Foundation. JSON Datasets. https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets, 2015.
[58]
TShark. https://www.wireshark.org/docs/man-pages/tshark.html.
[59]
Tuck, Nathan and Sherwood, Timothy and Calder, Brad and Varghese, George. Deterministic Memory-efficient String Matching Algorithms for Intrusion Detection. In INFOCOM 2004. Twenty-third Annual Joint Conference of the IEEE Computer and Communications Societies, volume 4, pages 2628--2639. IEEE, 2004.
[60]
Introduction to Twitter JSON. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.
[61]
Viola, Paul and Jones, Michael. Rapid Object Detection using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I--I. IEEE, 2001.
[62]
Zaharia, Matei and Chowdhury, Mosharaf and Das, Tathagata and Dave, Ankur and Ma, Justin and McCauley, Murphy and Franklin, Michael J and Shenker, Scott and Stoica, Ion. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.

Cited By

View all

Index Terms

  1. Filter before you parse: faster analytics on raw data with sparser
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 11
    July 2018
    507 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2018
    Published in PVLDB Volume 11, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 14 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media