skip to main content
10.1145/3543507.3583515acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

Published: 30 April 2023 Publication History

Abstract

Format inconsistency is one of the most frequently appearing data quality issues encountered during data cleaning. Existing automated approaches commonly lack applicability and generalisability, while approaches with human inputs typically require specialized skills such as writing regular expressions. This paper proposes a novel hybrid human-machine system, namely “Data-Scanner-4C”, which leverages crowdsourcing to address syntactic format inconsistencies in a single column effectively. We first ask crowd workers to create examples from single-column data through “data selection” and “result validation” tasks. Then, we propose and use a novel rule-based learning algorithm to infer the regular expressions that propagate formats from created examples to the entire column. Our system integrates crowdsourcing and algorithmic format extraction techniques in a single workflow. Having human experts write regular expressions is no longer required, thereby reducing both the time as well as the opportunity for error. We conducted experiments through both synthetic and real-world datasets, and our results show how the proposed approach is applicable and effective across data types and formats.

Supplemental Material

MP4 File
Three-minute presentation
PPTX File
Presentation slides

References

[1]
Paolo Arcaini, Angelo Gargantini, and Elvinia Riccobene. 2019. Regular expression learning with evolutionary testing and repair. In IFIP International Conference on Testing Software and Systems. Springer, 22–40.
[2]
Rohit Babbar and Nidhi Singh. 2010. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. 43–50.
[3]
Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Can a Machine Replace Humans in Building Regular Expressions¿ A Case Study. IEEE Intelligent Systems 31, 6 (2016), 15–21. https://doi.org/10.1109/MIS.2016.46
[4]
Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. 2016. Inference of regular expressions for text extraction from examples. IEEE Transactions on Knowledge and Data Engineering 28, 5 (2016), 1217–1230.
[5]
Falk Brauer, Robert Rieger, Adrian Mocan, and Wojciech M Barczynski. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th ACM international conference on Information and knowledge management. 1285–1294.
[6]
Joseph Chee Chang, Aniket Kittur, and Nathan Hahn. 2016. Alloy: Clustering with crowds and computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 3180–3191.
[7]
Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1247–1261.
[8]
Robert A Cochran, Loris D’Antoni, Benjamin Livshits, David Molnar, and Margus Veanes. 2015. Program boosting: Program synthesis via crowd-sourcing. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 677–688.
[9]
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. John Wiley & Sons.
[10]
Henning Fernau. 2009. Algorithms for learning regular expressions from positive data. Information and Computation 207, 4 (2009), 521–541.
[11]
U. Gadiraju, G. Demartini, R. Kawase, and S. Dietze. 2015. Human Beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. IEEE Intelligent Systems 30, 4 (2015), 81–85.
[12]
Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices 46, 1 (2011), 317–330.
[13]
Lei Han, Tianwa Chen, Gianluca Demartini, Marta Indulska, and Shazia Sadiq. 2020. On understanding data worker interaction behaviors. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 269–278.
[14]
Ihab F Ilyas and Xu Chu. 2019. Data cleaning. Morgan & Claypool.
[15]
Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683–698.
[16]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the sigchi conference on human factors in computing systems. 3363–3372.
[17]
Efim Kinber. 2010. Learning regular expressions from representative examples and membership queries. In International Colloquium on Grammatical Inference. Springer, 94–108.
[18]
Sanjay Krishnan, Daniel Haas, Michael J. Franklin, and Eugene Wu. 2016. Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (San Francisco, California) (HILDA ’16). Association for Computing Machinery, New York, NY, USA, Article 9, 5 pages. https://doi.org/10.1145/2939502.2939511
[19]
Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing regular expressions from examples for introductory automata assignments. In Proceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. 70–80.
[20]
Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and HV Jagadish. 2008. Regular expression learning for information extraction. In Proceedings of the 2008 conference on empirical methods in natural language processing. 21–30.
[21]
Karin Murthy, Prasad M Deshpande, 2012. Improving recall of regular expressions for information extraction. In International Conference on Web Information Systems Engineering. Springer, 455–467.
[22]
Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Record 42, 4 (2014), 40–49.
[23]
Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic repair of regular expressions. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019), 1–29.
[24]
Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter’s wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381–390.
[25]
Thomas Rebele, Katerina Tzompanaki, and Fabian M Suchanek. 2018. Adding missing words to regular expressions. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 67–79.
[26]
Yongxin Tong, Caleb Chen Cao, Chen Jason Zhang, Yatao Li, and Lei Chen. 2014. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In 2014 IEEE 30th International Conference on Data Engineering. IEEE, 1182–1185.
[27]
Shaun Wallace, Alexandra Papoutsaki, Neilly H Tan, Hua Guo, and Jeff Huang. 2021. Case studies on the motivation and performance of contributors who verify and maintain in-flux tabular datasets. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–25.
[28]
Gang Wang, Tianyi Wang, Haitao Zheng, and Ben Y Zhao. 2014. Man vs. machine: Practical adversarial detection of malicious crowdsourcing workers. In 23rd { USENIX} Security Symposium ({ USENIX} Security 14). 239–254.
[29]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).
[30]
Shaochen Yu, Tianwa Chen, Lei Han, Gianluca Demartini, and Shazia Sadiq. 2022. DataOps-4G: On Supporting Generalists in Data Quality Discovery. IEEE Transactions on Knowledge and Data Engineering (2022).
[31]
Shichao Zhang, Chengqi Zhang, and Qiang Yang. 2003. Data preparation for data mining. Applied artificial intelligence 17, 5-6 (2003), 375–381.

Index Terms

  1. Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '23: Proceedings of the ACM Web Conference 2023
    April 2023
    4293 pages
    ISBN:9781450394161
    DOI:10.1145/3543507
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 April 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. Crowdsourcing
    2. Format Inconsistency
    3. Human-in-the-loop
    4. Regular Expression

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Data Availability

    Funding Sources

    • ARC Discovery Project

    Conference

    WWW '23
    Sponsor:
    WWW '23: The ACM Web Conference 2023
    April 30 - May 4, 2023
    TX, Austin, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 119
      Total Downloads
    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 29 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media