research-article

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

Authors:

Themis Palpanas,

Divesh SrivastavaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 11

Pages 3072 - 3081

https://doi.org/10.14778/3681954.3681984

Published: 30 August 2024 Publication History

Abstract

Despite the increasing success of Machine Learning (ML) techniques in real-world applications, their maintenance over time remains challenging. In particular, the prediction accuracy of deployed ML models can suffer due to significant changes between training and serving data over time, known as data drift. Traditional data drift solutions primarily focus on detecting drift, and then retraining the ML models, but do not discern whether the detected drift is harmful to model performance. In this paper, we observe that not all data drifts lead to degradation in prediction accuracy. We then introduce a novel approach for identifying portions of data distributions in serving data where drift can be potentially harmful to model performance, which we term Data Distributions with Low Accuracy (DDLA). Our approach, using decision trees, precisely pinpoints low-accuracy zones within ML models, especially Blackbox models. By focusing on these DDLAs, we effectively assess the impact of data drift on model performance and make informed decisions in the ML pipeline. In contrast to existing data drift techniques, we advocate for model retraining only in cases of harmful drifts that detrimentally affect model performance. Through extensive experimental evaluations on various datasets and models, our findings demonstrate that our approach significantly improves cost-efficiency over baselines, while achieving comparable accuracy.

References

[1]

2017. Active Learning Playground. https://github.com/google/active-learning/tree/master Last accessed on July 28, 2024.

[2]

2019. House Sales in King County Data. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction. Last accessed on July 28, 2024.

[3]

Samuel Ackerman, Eitan Farchi, Orna Raz, Marcel Zalmanovici, and Parijat Dube. 2020. Detection of data drift and outliers affecting machine learning model performance over time. arXiv preprint arXiv:2012.09258 (2020).

[4]

Samuel Ackerman, Orna Raz, Marcel Zalmanovici, and Aviad Zlotnick. 2021. Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672 (2021).

[5]

S Amarappa and SV Sathyanarayana. 2014. Data classification using Support vector Machine (SVM), a simplified approach. Int. J. Electron. Comput. Sci. Eng 3 (2014), 435--445.

[6]

Philip Bachman, Alessandro Sordoni, and Adam Trischler. 2017. Learning algorithms for active learning. In international conference on machine learning. PMLR, 301--310.

[7]

Lucas Baier, Josua Reimold, and Niklas Kühl. 2020. Handling Concept Drift for Predictions in Business Process Mining. 2020 IEEE 22nd Conference on Business Informatics (CBI) 1 (2020), 76--83.

[8]

Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. Last accessed on July 28, 2024.

[9]

Ekaba Bisong and Ekaba Bisong. 2019. Logistic regression. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (2019), 243--250.

Digital Library

[10]

Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Sina Noack, Hendrik Patzlaff, Hazar Harmouch, and Felix Naumann. 2022. The Effects of Data Quality on Machine Learning Performance.

[11]

Xiuyuan Cheng and Alexander Cloninger. 2022. Classification logit two-sample testing by neural networks for differentiating near manifold densities. IEEE Transactions on Information Theory 68, 10 (2022), 6631--6662.

[12]

Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. 2015. Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems 28 (2015).

[13]

Commission for Energy Regulation (CER). 2012. CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009--2010 [dataset]. Irish Social Science Data Archive. https://www.ucd.ie/issda/data/commissionforenergyregulationcer/ SN: 0012-00, Last accessed on July 28, 2024.

[14]

Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable machine learning. Commun. ACM 63, 1 (2019), 68--77.

Digital Library

[15]

Giovanni Fasano and Alberto Franceschini. 1987. A multidimensional version of the Kolmogorov-Smirnov test. Monthly Notices of the Royal Astronomical Society 225, 1 (1987), 155--170.

[16]

Daniele Foroni, Matteo Lissandrini, and Yannis Velegrakis. 2021. Estimating the extent of the effects of Data Quality through Observations. 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2021), 1913--1918.

[17]

Nicholas Frosst and Geoffrey Hinton. 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017).

[18]

Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. In Advances in Artificial Intelligence-SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17. Springer, 286--295.

[19]

João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1--44:37.

[20]

Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. PVLDB. 8, 1 (2014), 61--72.

Digital Library

[21]

Tom Ginsberg, Zhongyuan Liang, and Rahul G. Krishnan. 2023. A Learning Based Hypothesis Test for Harmful Covariate Shift. arXiv:2212.02742 [cs.LG]

[22]

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. The Journal of Machine Learning Research 13, 1 (2012), 723--773.

Digital Library

[23]

Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, Bernhard Schölkopf, et al. 2009. Covariate shift by kernel mean matching. Dataset shift in machine learning 3, 4 (2009), 5.

[24]

Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021).

[25]

Peter Hase, Chaofan Chen, Oscar Li, and Cynthia Rudin. 2019. Interpretable image recognition with hierarchical prototypes. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 32--40.

[26]

Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. 2010. Active learning by querying informative and representative examples. Advances in neural information processing systems 23 (2010).

[27]

Shichao Jia, Peiwen Lin, Zeyu Li, Jiawan Zhang, and Shixia Liu. 2020. Visualizing surrogate decision trees of convolutional neural networks. Journal of Visualization 23 (2020), 141--156.

Digital Library

[28]

Minsuk Kahng, Dezhi Fang, and Duen Horng Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--6.

Digital Library

[29]

Bojan Karlaš, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020).

[30]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. PVLDB 9, 12 (2016), 948--959.

Digital Library

[31]

Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2017. Interpretable & explorable approximations of black box models. KDD'17, Workshop on Fairness, Accountability, and Transparency in Machine Learning (2017).

[32]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. CoRR abs/1904.09483 (2019).

[33]

Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In International conference on machine learning. PMLR, 3122--3130.

[34]

Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. 2020. Learning deep kernels for non-parametric two-sample tests. In International conference on machine learning. PMLR, 6316--6326.

[35]

David Lopez-Paz and Maxime Oquab. 2017. Revisiting classifier two-sample tests. ICLR (2017).

[36]

Ankur Mallick, Kevin Hsieh, Behnaz Arzani, and Gauri Joshi. 2022. Matchmaker: Data Drift Mitigation in Machine Learning for Large-Scale Systems. Proceedings of Machine Learning and Systems 4 (2022), 77--94.

[37]

Katsiaryna Mirylenka, George Giannakopoulos, Le Minh Do, and Themis Palpanas. 2017. On classifier behavior in the presence of mislabeling noise. Data Min. Knowl. Discov. 31, 3 (2017), 661--701.

Digital Library

[38]

Bo Pang, Erik Nijkamp, and Ying Nian Wu. 2020. Deep learning with tensorflow: A review. Journal of Educational and Behavioral Statistics 45, 2 (2020), 227--248.

[39]

Aleksandr Podkopaev and Aaditya Ramdas. 2022. Tracking the risk of a deployed model and detecting harmful distribution shifts. arXiv:2110.06177 [stat.ML]

[40]

Maria E Ramirez-Loaiza, Manali Sharma, Geet Kumar, and Mustafa Bilgic. 2017. Active learning: an empirical study of common baselines. Data mining and knowledge discovery 31 (2017), 287--313.

[41]

Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Lakshminarayanan. 2021. A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022 (2021).

[42]

Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš, Wentao Wu, and Ce Zhang. 2021. A Data Quality-Driven View of MLOps. arXiv:2102.07750 [cs.LG]

[43]

Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2022. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys 16 (2022), 1--85.

[44]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. PVLDB 11, 12 (2018), 1781--1794.

Digital Library

[45]

Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning to validate the predictions of black box classifiers on unseen data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1289--1299.

Digital Library

[46]

Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin-Madison. http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdf

[47]

Manali Sharma and Mustafa Bilgic. 2017. Evidence-based uncertainty sampling for active learning. Data Mining and Knowledge Discovery 31 (2017), 164--202.

Digital Library

[48]

Alaa Tharwat and Wolfram Schenck. 2023. A Survey on Active Learning: State-of-the-Art, Practical Challenges and Research Directions. Mathematics 11, 4 (2023), 820.

[49]

Heng Wang and Zubin Abraham. 2015. Concept drift detection for streaming data. In 2015 international joint conference on neural networks (IJCNN). IEEE, 1--9.

[50]

Geoffrey I Webb, Loong Kuan Lee, Bart Goethals, and François Petitjean. 2018. Analyzing concept drift and shift from sample data. Data Mining and Knowledge Discovery 32 (2018), 1179--1199.

Digital Library

[51]

G. Widmer and M. Kubat. 1996. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning 23, 1 (1996), 69--101.

Digital Library

[52]

Hoyoung Woo and Cheong Hee Park. 2013. An efficient active learning method based on random sampling and backward deletion. In Intelligent Science and Intelligent Data Engineering: Third Sino-foreign-interchange Workshop, IScIDE 2012, Nanjing, China, October 15--17, 2012. Revised Selected Papers 3. Springer, 683--691.

Digital Library

[53]

Doris Xin, Hui Miao, Aditya G. Parameswaran, and Neoklis Polyzotis. 2021. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. Proceedings of the 2021 International Conference on Management of Data (2021).

Digital Library

[54]

Lingxiao Yuan, S. Park, Harold, and Emma Lejeune. 2022. Mechanical MNIST - Distribution Shift. https://open.bu.edu/handle/2144/44485 Last accessed on July 28, 2024.

[55]

Jan Zenisek, Florian Holzinger, and Michael Affenzeller. 2019. Machine learning based concept drift detection for predictive maintenance. Computers & Industrial Engineering 137 (2019), 106031.

[56]

Quanshi Zhang, Yu Yang, Haotian Ma, and Ying Nian Wu. 2019. Interpreting cnns via decision trees. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6261--6270.

[57]

Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming Song, and Stefano Ermon. 2021. Comparing distributions by measuring differences that affect decision making. In International Conference on Learning Representations.

Index Terms

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning algorithms
      1. Ensemble methods
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Data stream mining
    2. Decision support systems
      1. Data analytics

Index terms have been assigned to the content through auto-classification.

Recommendations

Concept Drift in Japanese COVID-19 Infection Data
Abstract
In this study, we analyze concept drifts in the daily infection data of COVID-19 in Japan. A lockdown, the spread of vaccines, and the emergence of new variants of COVID-19 have had a significant impact on the number of daily infections. These ...
Detecting concept drift using HEDDM in data stream

In evolving data stream, when its concept undergoes a change it is known as concept drift. Detecting concept drift and handling it is a challenging task in data stream mining. If an algorithm is not adapted to concept drift, then it directly affects its ...
DriftGAN: Using historical data for Unsupervised Recurring Drift Detection
SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

In real-world applications, input data distributions are rarely static over a period of time, a phenomenon known as concept drift. Such concept drifts degrade the model's prediction performance, and therefore we require methods to overcome these issues. ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 11

July 2024

1039 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 30 August 2024

Published in PVLDB Volume 17, Issue 11

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
123
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)49

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents