research-article

Examining the challenges in development data pipeline

Authors:

Aditya Vashistha,

Richard AndersonAuthors Info & Claims

COMPASS '19: Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies

Pages 13 - 21

https://doi.org/10.1145/3314344.3332496

Published: 03 July 2019 Publication History

Abstract

The developing world has increasingly relied on data driven policies. Numerous development agencies have pushed for on-ground data collection to support the development work they pursue. Many governments have launched their own efforts for frequent information gathering. Overall, the amount of data collected is tremendous, yet there are significant issues in doing useful analysis. Most of these barriers manifest in data cleaning and merging, and require a data engineer to support some parts of the analysis. In this paper, we investigate the challenges of cleaning development data through an interview based study. We conducted face to face interviews of 13 stakeholders, eight from international development organizations and five government workers from Pakistan, including both managers and data analysts. From analysis of the interviews we identified common challenges faced in processing development data including correcting open text fields, merging hierarchical data, and extracting data from textual formats such as PDF. We construct a basic taxonomy of data cleaning challenges, and identify areas where support tools can improve the process. Ultimately, the objective is to empower regular data users to easily do the necessary data cleaning and scrubbing for analysis.

References

[1]

David Avison and Guy Fitzgerald. 2003. Information systems development: methodologies, techniques and tools. McGraw Hill.

Digital Library

[2]

Tony Blakely and Clare Salmond. 2002. Probabilistic record linkage and a method to calculate the positive predictive value. International journal of epidemiology 31, 6 (2002), 1246--1252.

[3]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77--101.

[4]

Stuart K Card, Jock D Mackinlay, and Ben Shneiderman. 1999. Using vision to think. In Readings in information visualization. Morgan Kaufmann Publishers Inc., 579--581.

Digital Library

[5]

Kuang Chen, Emma Brunskill, Jonathan Dick, and Prabhjot Dhadialla. 2010. Learning to Identify Locally Actionable Health Anomalies. In AAAI Spring Symposium: Artificial Intelligence for Development.

[6]

Kuang Chen, Harr Chen, Neil Conway, Joseph M Hellerstein, and Tapan S Parikh. 2011a. Usher: Improving data quality with dynamic forms. IEEE Transactions on Knowledge and Data Engineering 23, 8 (2011), 1138--1153.

Digital Library

[7]

Kuang Chen, Joseph M Hellerstein, and Tapan S Parikh. 2011b. Data in the First Mile. In CIDR. Citeseer, 203--206.

[8]

Kuang Chen, Akshay Kannan, Yoriyasu Yano, Joseph M Hellerstein, and Tapan S Parikh. 2012. Shreddr: pipelined paper digitization for low-resource organizations. In Proceedings of the 2nd ACM Symposium on Computing for Development. ACM, 3.

Digital Library

[9]

Peter Christen. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.

Digital Library

[10]

Tamraparni Dasu and Theodore Johnson. 2003. Exploratory data mining and data cleaning. Vol. 479. John Wiley & Sons.

Digital Library

[11]

Nicola Dell, Nathan Breit, Jacob O Wobbrock, and Gaetano Borriello. 2013. Improving form-based data entry with image snippets. In Proceedings of Graphics Interface 2013. Canadian Information Processing Society, 157--164.

Digital Library

[12]

Nicola Dell, Trevor Perrier, Neha Kumar, Mitchell Lee, Rachel Powers, and Gaetano Borriello. 2015. Paper-digital workflows in global development organizations. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 1659--1669.

Digital Library

[13]

Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2007).

Digital Library

[14]

S Thomas Foster and Kunal K Ganguly. 2007. Managing quality: Integrating the supply chain. Pearson Prentice Hall Upper Saddle River, New Jersey.

[15]

Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018--2019.

Digital Library

[16]

Pat Hanrahan. 2003. Tableau software white paper-visual thinking for business intelligence. Tableau Software, Seattle, WA (2003).

[17]

Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008).

[18]

Joseph M Hellerstein. 2016. People, Computers, and The Hot Mess of Real Data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 7--7.

Digital Library

[19]

HV Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big data and its technical challenges. Commun. ACM 57, 7 (2014), 86--94.

Digital Library

[20]

Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372.

Digital Library

[21]

Josephine Karuri, Peter Waiganjo, ORWA Daniel, and Ayub MANYA. 2014. DHIS2: The tool to improve health data demand and use in Kenya. Journal of Health Informatics in Developing Countries 8, 1 (2014).

[22]

Won Kim, Byoung-Ju Choi, Eui-Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A taxonomy of dirty data. Data mining and knowledge discovery 7, 1 (2003), 81--99.

Digital Library

[23]

Mong Li Lee, Hongjun Lu, Tok Wang Ling, and Yee Teng Ko. 1999. Cleansing data for mining and warehousing. In International Conference on Database and Expert Systems Applications. Springer, 751--760.

Digital Library

[24]

Hong Ma. 2012. Google Refine-http://code.google.com/p/google-refine. Technical Services Quarterly 29, 3 (2012), 242--243.

[25]

Sriganesh Madhvanath, Geetha Manjunath, Suryaprakash Kompalli, Serene Banerjee, Sitaram Ramachandrula, and Srinivasu Godavari. 2013. PaperWeb: paper-triggered web interactions. In Proceedings of the 3rd ACM Symposium on Computing for Development. ACM, 43.

Digital Library

[26]

Kedar S Mate, Brandon Bennett, Wendy Mphatswe, Pierre Barker, and Nigel Rollins. 2009. Challenges for routine health system data management in a large public programme to prevent mother-to-child HIV transmission in South Africa. PloS one 4, 5 (2009), e5483.

[27]

Wes McKinney and others. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Vol. 445. van der Voort S, Millman J, 51--56.

[28]

Nora Méray, Johannes B Reitsma, Anita CJ Ravelli, and Gouke J Bonsel. 2007. Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. Journal of clinical epidemiology 60, 9 (2007), 883-e1.

[29]

Alvaro E. Monge. 2000. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23, 4 (2000), 14--20.

[30]

Matthew J O'Brien, Allison P Squires, Rebecca A Bixby, and Steven C Larson. 2009. Role development of community health workers: an examination of selection and training processes in the intervention literature. American journal of preventive medicine 37, 6 (2009), S262--S269.

[31]

Tapan S Parikh. 2009. Engineering rural development. Commun. ACM 52, 1 (2009), 54--63.

Digital Library

[32]

Tapan S Parikh, Paul Javid, Kaushik Ghosh, Kentaro Toyama, and others. 2006. Mobile phones and paper documents: evaluating a new approach for capturing microfinance data in rural India. In Proceedings of the SIGCHI conference on Human Factors in computing systems. ACM, 551--560.

Digital Library

[33]

Fahad Pervaiz, Richard Anderson, and Sophie Newland. Data Specification for Information Systems for the Immunization Cold Chain. (????).

[34]

Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.

[35]

Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter's wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381--390.

Digital Library

[36]

Christopher J Seebregts, Burke W Mamlin, Paul G Biondich, Hamish SF Fraser, Benjamin A Wolfe, Darius Jazayeri, Christian Allen, Justin Miranda, Elaine Baker, Nicholas Musinguzi, and others. 2009. The OpenMRS implementers network. International journal of medical informatics 78, 11 (2009), 711--720.

[37]

T Svoronos, P Mjungu, R Dhadialla, R Luk, C Zue, J Jackson, and N Lesh. 2010. CommCare: Automated quality improvement to strengthen community-based health. Weston: D-Tree International (2010).

[38]

Hadley Wickham and others. 2014. Tidy data. Journal of Statistical Software 59, 10 (2014), 1--23.

Cited By

Ionescu AMouw ZAivaloglou EKatsifodimos AFekete JOmidvar-Tehrani BRong KShraga R(2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3665939.3665961
Mohseni ZMasiello IMartins R(2024)A technical infrastructure for primary education data that contributes to data standardizationEducation and Information Technologies10.1007/s10639-024-12683-229:16(21045-21061)Online publication date: 27-Apr-2024
https://doi.org/10.1007/s10639-024-12683-2
Witayangkurn AArai AShibasaki R(2022)Development of Big Data-Analysis Pipeline for Mobile Phone Data with Mobipack and Spatial EnhancementISPRS International Journal of Geo-Information10.3390/ijgi1103019611:3(196)Online publication date: 15-Mar-2022
https://doi.org/10.3390/ijgi11030196
Show More Cited By

Index Terms

Examining the challenges in development data pipeline
1. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

ICT for development and UISC: a step to alleviate poverty in rural areas of Bangladesh
ICEGOV '13: Proceedings of the 7th International Conference on Theory and Practice of Electronic Governance

This paper presents a case study on the ICT for development initiative in rural area launched recently by the Government of Bangladesh with UNDP support. The study concerns about how Union Information and Service Centers (UISC), established at the ...
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

COMPASS '19: Proceedings of the 2nd ACM SIGCAS Conference on Computing and Sustainable Societies

July 2019

290 pages

ISBN:9781450367141

DOI:10.1145/3314344

Program Chairs:
Jay Chen
NYU-AD
,
Jennifer Mankoff
University of Washington
,
Carla Gomes
Cornell

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCAS: ACM Special Interest Group on Computers and Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

COMPASS '19

Sponsor:

SIGCAS

COMPASS '19: ACM SIGCAS Conference on Computing and Sustainable Societies

July 3 - 5, 2019

Accra, Ghana

Acceptance Rates

COMPASS '19 Paper Acceptance Rate 25 of 50 submissions, 50%;

Overall Acceptance Rate 25 of 50 submissions, 50%

Upcoming Conference

COMPASS '25

Sponsor:
sigcas
sigcas

ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies

July 22 - 25, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
459
Total Downloads

Downloads (Last 12 months)100
Downloads (Last 6 weeks)10

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ionescu AMouw ZAivaloglou EKatsifodimos AFekete JOmidvar-Tehrani BRong KShraga R(2024)Key Insights from a Feature Discovery User StudyProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665961(1-5)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3665939.3665961
Mohseni ZMasiello IMartins R(2024)A technical infrastructure for primary education data that contributes to data standardizationEducation and Information Technologies10.1007/s10639-024-12683-229:16(21045-21061)Online publication date: 27-Apr-2024
https://doi.org/10.1007/s10639-024-12683-2
Witayangkurn AArai AShibasaki R(2022)Development of Big Data-Analysis Pipeline for Mobile Phone Data with Mobipack and Spatial EnhancementISPRS International Journal of Geo-Information10.3390/ijgi1103019611:3(196)Online publication date: 15-Mar-2022
https://doi.org/10.3390/ijgi11030196
Mourched BHoxha MAbdelgalil AFerko NAbdallah MPotams ALushi ATuran HVrtagic S(2022)Piezoelectric-Based Sensor Concept and Design with Machine Learning-Enabled Using COMSOL MultiphysicsApplied Sciences10.3390/app1219979812:19(9798)Online publication date: 29-Sep-2022
https://doi.org/10.3390/app12199798
Okolo CDell NVashistha A(2022)Making AI Explainable in the Global South: A Systematic ReviewProceedings of the 5th ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies10.1145/3530190.3534802(439-452)Online publication date: 29-Jun-2022
https://dl.acm.org/doi/10.1145/3530190.3534802
Sambasivan NVeeraraghavan R(2022)The Deskilling of Domain Expertise in AI DevelopmentProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517578(1-14)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3517578
Thakkar DIsmail AKumar PHanna ASambasivan NKumar N(2022)When is Machine Learning Data Good?: Valuing in Public Health DataficationProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3501868(1-16)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3501868
Thorve SVullikanti ASwarup SMortveit HMarathe M(2022)Modular and Extensible Pipelines for Residential Energy Demand Modeling and Simulation2022 Winter Simulation Conference (WSC)10.1109/WSC57314.2022.10015339(855-866)Online publication date: 11-Dec-2022
https://doi.org/10.1109/WSC57314.2022.10015339
Sambasivan NKapania SHighfill HAkrong DParitosh PAroyo LKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AIProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445518(1-15)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445518
Ranjan RLekan KBhaip V(2021)Designing the UVA Open Data Initiative: Increasing Engagement for Students, Faculty, Staff Members, and Other Stakeholders2021 Systems and Information Engineering Design Symposium (SIEDS)10.1109/SIEDS52267.2021.9483750(1-6)Online publication date: 30-Apr-2021
https://doi.org/10.1109/SIEDS52267.2021.9483750
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents