skip to main content
10.1145/3314344.3332496acmconferencesArticle/Chapter ViewAbstractPublication PagescompassConference Proceedingsconference-collections
research-article

Examining the challenges in development data pipeline

Published: 03 July 2019 Publication History

Abstract

The developing world has increasingly relied on data driven policies. Numerous development agencies have pushed for on-ground data collection to support the development work they pursue. Many governments have launched their own efforts for frequent information gathering. Overall, the amount of data collected is tremendous, yet there are significant issues in doing useful analysis. Most of these barriers manifest in data cleaning and merging, and require a data engineer to support some parts of the analysis. In this paper, we investigate the challenges of cleaning development data through an interview based study. We conducted face to face interviews of 13 stakeholders, eight from international development organizations and five government workers from Pakistan, including both managers and data analysts. From analysis of the interviews we identified common challenges faced in processing development data including correcting open text fields, merging hierarchical data, and extracting data from textual formats such as PDF. We construct a basic taxonomy of data cleaning challenges, and identify areas where support tools can improve the process. Ultimately, the objective is to empower regular data users to easily do the necessary data cleaning and scrubbing for analysis.

References

[1]
David Avison and Guy Fitzgerald. 2003. Information systems development: methodologies, techniques and tools. McGraw Hill.
[2]
Tony Blakely and Clare Salmond. 2002. Probabilistic record linkage and a method to calculate the positive predictive value. International journal of epidemiology 31, 6 (2002), 1246--1252.
[3]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77--101.
[4]
Stuart K Card, Jock D Mackinlay, and Ben Shneiderman. 1999. Using vision to think. In Readings in information visualization. Morgan Kaufmann Publishers Inc., 579--581.
[5]
Kuang Chen, Emma Brunskill, Jonathan Dick, and Prabhjot Dhadialla. 2010. Learning to Identify Locally Actionable Health Anomalies. In AAAI Spring Symposium: Artificial Intelligence for Development.
[6]
Kuang Chen, Harr Chen, Neil Conway, Joseph M Hellerstein, and Tapan S Parikh. 2011a. Usher: Improving data quality with dynamic forms. IEEE Transactions on Knowledge and Data Engineering 23, 8 (2011), 1138--1153.
[7]
Kuang Chen, Joseph M Hellerstein, and Tapan S Parikh. 2011b. Data in the First Mile. In CIDR. Citeseer, 203--206.
[8]
Kuang Chen, Akshay Kannan, Yoriyasu Yano, Joseph M Hellerstein, and Tapan S Parikh. 2012. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. ACM, 3.
[9]
Peter Christen. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.
[10]
Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.
[11]
Nicola Dell, Nathan Breit, Jacob O Wobbrock, and Gaetano Borriello. 2013. Improving form-based data entry with image snippets. In Proceedings of Graphics Interface 2013. Canadian Information Processing Society, 157--164.
[12]
Nicola Dell, Trevor Perrier, Neha Kumar, Mitchell Lee, Rachel Powers, and Gaetano Borriello. 2015. Paper-digital workflows in global development organizations. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 1659--1669.
[13]
Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2007).
[14]
S Thomas Foster and Kunal K Ganguly. 2007. Managing quality: Integrating the supply chain. Pearson Prentice Hall Upper Saddle River, New Jersey.
[15]
Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018--2019.
[16]
Pat Hanrahan. 2003. Tableau software white paper-visual thinking for business intelligence. Tableau Software, Seattle, WA (2003).
[17]
Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008).
[18]
Joseph M Hellerstein. 2016. People, Computers, and The Hot Mess of Real Data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 7--7.
[19]
HV Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big data and its technical challenges. Commun. ACM 57, 7 (2014), 86--94.
[20]
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372.
[21]
Josephine Karuri, Peter Waiganjo, ORWA Daniel, and Ayub MANYA. 2014. DHIS2: The tool to improve health data demand and use in Kenya. Journal of Health Informatics in Developing Countries 8, 1 (2014).
[22]
Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A taxonomy of dirty data. Data mining and knowledge discovery 7, 1 (2003), 81--99.
[23]
Mong Li Lee, Hongjun Lu, Tok Wang Ling, and Yee Teng Ko. 1999. Cleansing data for mining and warehousing. In International Conference on Database and Expert Systems Applications. Springer, 751--760.
[24]
Hong Ma. 2012. Google Refine-http://code.google.com/p/google-refine. Technical Services Quarterly 29, 3 (2012), 242--243.
[25]
Sriganesh Madhvanath, Geetha Manjunath, Suryaprakash Kompalli, Serene Banerjee, Sitaram Ramachandrula, and Srinivasu Godavari. 2013. PaperWeb: paper-triggered web interactions. In Proceedings of the 3rd ACM Symposium on Computing for Development. ACM, 43.
[26]
Kedar S Mate, Brandon Bennett, Wendy Mphatswe, Pierre Barker, and Nigel Rollins. 2009. Challenges for routine health system data management in a large public programme to prevent mother-to-child HIV transmission in South Africa. PloS one 4, 5 (2009), e5483.
[27]
Wes McKinney and others. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. van der Voort S, Millman J, 51--56.
[28]
Nora Méray, Johannes B Reitsma, Anita CJ Ravelli, and Gouke J Bonsel. 2007. Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. Journal of clinical epidemiology 60, 9 (2007), 883-e1.
[29]
Alvaro E. Monge. 2000. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23, 4 (2000), 14--20.
[30]
Matthew J O'Brien, Allison P Squires, Rebecca A Bixby, and Steven C Larson. 2009. Role development of community health workers: an examination of selection and training processes in the intervention literature. American journal of preventive medicine 37, 6 (2009), S262--S269.
[31]
Tapan S Parikh. 2009. Engineering rural development. Commun. ACM 52, 1 (2009), 54--63.
[32]
Tapan S Parikh, Paul Javid, Kaushik Ghosh, Kentaro Toyama, and others. 2006. Mobile phones and paper documents: evaluating a new approach for capturing microfinance data in rural India. In Proceedings of the SIGCHI conference on Human Factors in computing systems. ACM, 551--560.
[33]
Fahad Pervaiz, Richard Anderson, and Sophie Newland. Data Specification for Information Systems for the Immunization Cold Chain. (????).
[34]
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.
[35]
Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter's wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381--390.
[36]
Christopher J Seebregts, Burke W Mamlin, Paul G Biondich, Hamish SF Fraser, Benjamin A Wolfe, Darius Jazayeri, Christian Allen, Justin Miranda, Elaine Baker, Nicholas Musinguzi, and others. 2009. The OpenMRS implementers network. International journal of medical informatics 78, 11 (2009), 711--720.
[37]
T Svoronos, P Mjungu, R Dhadialla, R Luk, C Zue, J Jackson, and N Lesh. 2010. CommCare: Automated quality improvement to strengthen community-based health. Weston: D-Tree International (2010).
[38]
Hadley Wickham and others. 2014. Tidy data. Journal of Statistical Software 59, 10 (2014), 1--23.

Cited By

View all

Index Terms

  1. Examining the challenges in development data pipeline

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    COMPASS '19: Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies
    July 2019
    290 pages
    ISBN:9781450367141
    DOI:10.1145/3314344
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 July 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. HCI4D
    2. ICTD
    3. data analysis
    4. data cleaning
    5. data collection
    6. global development

    Qualifiers

    • Research-article

    Conference

    COMPASS '19
    Sponsor:

    Acceptance Rates

    COMPASS '19 Paper Acceptance Rate 25 of 50 submissions, 50%;
    Overall Acceptance Rate 25 of 50 submissions, 50%

    Upcoming Conference

    COMPASS '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)100
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media