skip to main content
research-article

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

Published: 30 August 2024 Publication History

Abstract

Despite the increasing success of Machine Learning (ML) techniques in real-world applications, their maintenance over time remains challenging. In particular, the prediction accuracy of deployed ML models can suffer due to significant changes between training and serving data over time, known as data drift. Traditional data drift solutions primarily focus on detecting drift, and then retraining the ML models, but do not discern whether the detected drift is harmful to model performance. In this paper, we observe that not all data drifts lead to degradation in prediction accuracy. We then introduce a novel approach for identifying portions of data distributions in serving data where drift can be potentially harmful to model performance, which we term Data Distributions with Low Accuracy (DDLA). Our approach, using decision trees, precisely pinpoints low-accuracy zones within ML models, especially Blackbox models. By focusing on these DDLAs, we effectively assess the impact of data drift on model performance and make informed decisions in the ML pipeline. In contrast to existing data drift techniques, we advocate for model retraining only in cases of harmful drifts that detrimentally affect model performance. Through extensive experimental evaluations on various datasets and models, our findings demonstrate that our approach significantly improves cost-efficiency over baselines, while achieving comparable accuracy.

References

[1]
2017. Active Learning Playground. https://github.com/google/active-learning/tree/master Last accessed on July 28, 2024.
[2]
2019. House Sales in King County Data. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction. Last accessed on July 28, 2024.
[3]
Samuel Ackerman, Eitan Farchi, Orna Raz, Marcel Zalmanovici, and Parijat Dube. 2020. Detection of data drift and outliers affecting machine learning model performance over time. arXiv preprint arXiv:2012.09258 (2020).
[4]
Samuel Ackerman, Orna Raz, Marcel Zalmanovici, and Aviad Zlotnick. 2021. Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672 (2021).
[5]
S Amarappa and SV Sathyanarayana. 2014. Data classification using Support vector Machine (SVM), a simplified approach. Int. J. Electron. Comput. Sci. Eng 3 (2014), 435--445.
[6]
Philip Bachman, Alessandro Sordoni, and Adam Trischler. 2017. Learning algorithms for active learning. In international conference on machine learning. PMLR, 301--310.
[7]
Lucas Baier, Josua Reimold, and Niklas Kühl. 2020. Handling Concept Drift for Predictions in Business Process Mining. 2020 IEEE 22nd Conference on Business Informatics (CBI) 1 (2020), 76--83.
[8]
Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. Last accessed on July 28, 2024.
[9]
Ekaba Bisong and Ekaba Bisong. 2019. Logistic regression. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (2019), 243--250.
[10]
Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Sina Noack, Hendrik Patzlaff, Hazar Harmouch, and Felix Naumann. 2022. The Effects of Data Quality on Machine Learning Performance.
[11]
Xiuyuan Cheng and Alexander Cloninger. 2022. Classification logit two-sample testing by neural networks for differentiating near manifold densities. IEEE Transactions on Information Theory 68, 10 (2022), 6631--6662.
[12]
Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. 2015. Fast two-sample testing with analytic representations of probability measures. Advances in Neural Information Processing Systems 28 (2015).
[13]
Commission for Energy Regulation (CER). 2012. CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009--2010 [dataset]. Irish Social Science Data Archive. https://www.ucd.ie/issda/data/commissionforenergyregulationcer/ SN: 0012-00, Last accessed on July 28, 2024.
[14]
Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable machine learning. Commun. ACM 63, 1 (2019), 68--77.
[15]
Giovanni Fasano and Alberto Franceschini. 1987. A multidimensional version of the Kolmogorov-Smirnov test. Monthly Notices of the Royal Astronomical Society 225, 1 (1987), 155--170.
[16]
Daniele Foroni, Matteo Lissandrini, and Yannis Velegrakis. 2021. Estimating the extent of the effects of Data Quality through Observations. 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2021), 1913--1918.
[17]
Nicholas Frosst and Geoffrey Hinton. 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017).
[18]
Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with drift detection. In Advances in Artificial Intelligence-SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17. Springer, 286--295.
[19]
João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 44:1--44:37.
[20]
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. PVLDB. 8, 1 (2014), 61--72.
[21]
Tom Ginsberg, Zhongyuan Liang, and Rahul G. Krishnan. 2023. A Learning Based Hypothesis Test for Harmful Covariate Shift. arXiv:2212.02742 [cs.LG]
[22]
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. The Journal of Machine Learning Research 13, 1 (2012), 723--773.
[23]
Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, Bernhard Schölkopf, et al. 2009. Covariate shift by kernel mean matching. Dataset shift in machine learning 3, 4 (2009), 5.
[24]
Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, et al. 2021. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 (2021).
[25]
Peter Hase, Chaofan Chen, Oscar Li, and Cynthia Rudin. 2019. Interpretable image recognition with hierarchical prototypes. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 32--40.
[26]
Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. 2010. Active learning by querying informative and representative examples. Advances in neural information processing systems 23 (2010).
[27]
Shichao Jia, Peiwen Lin, Zeyu Li, Jiawan Zhang, and Shixia Liu. 2020. Visualizing surrogate decision trees of convolutional neural networks. Journal of Visualization 23 (2020), 141--156.
[28]
Minsuk Kahng, Dezhi Fang, and Duen Horng Chau. 2016. Visual exploration of machine learning results using data cube analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1--6.
[29]
Bojan Karlaš, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2020. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020).
[30]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. PVLDB 9, 12 (2016), 948--959.
[31]
Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2017. Interpretable & explorable approximations of black box models. KDD'17, Workshop on Fairness, Accountability, and Transparency in Machine Learning (2017).
[32]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. CoRR abs/1904.09483 (2019).
[33]
Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In International conference on machine learning. PMLR, 3122--3130.
[34]
Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and Danica J Sutherland. 2020. Learning deep kernels for non-parametric two-sample tests. In International conference on machine learning. PMLR, 6316--6326.
[35]
David Lopez-Paz and Maxime Oquab. 2017. Revisiting classifier two-sample tests. ICLR (2017).
[36]
Ankur Mallick, Kevin Hsieh, Behnaz Arzani, and Gauri Joshi. 2022. Matchmaker: Data Drift Mitigation in Machine Learning for Large-Scale Systems. Proceedings of Machine Learning and Systems 4 (2022), 77--94.
[37]
Katsiaryna Mirylenka, George Giannakopoulos, Le Minh Do, and Themis Palpanas. 2017. On classifier behavior in the presence of mislabeling noise. Data Min. Knowl. Discov. 31, 3 (2017), 661--701.
[38]
Bo Pang, Erik Nijkamp, and Ying Nian Wu. 2020. Deep learning with tensorflow: A review. Journal of Educational and Behavioral Statistics 45, 2 (2020), 227--248.
[39]
Aleksandr Podkopaev and Aaditya Ramdas. 2022. Tracking the risk of a deployed model and detecting harmful distribution shifts. arXiv:2110.06177 [stat.ML]
[40]
Maria E Ramirez-Loaiza, Manali Sharma, Geet Kumar, and Mustafa Bilgic. 2017. Active learning: an empirical study of common baselines. Data mining and knowledge discovery 31 (2017), 287--313.
[41]
Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Lakshminarayanan. 2021. A simple fix to mahalanobis distance for improving near-ood detection. arXiv preprint arXiv:2106.09022 (2021).
[42]
Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš, Wentao Wu, and Ce Zhang. 2021. A Data Quality-Driven View of MLOps. arXiv:2102.07750 [cs.LG]
[43]
Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. 2022. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys 16 (2022), 1--85.
[44]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. PVLDB 11, 12 (2018), 1781--1794.
[45]
Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning to validate the predictions of black box classifiers on unseen data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1289--1299.
[46]
Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin-Madison. http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdf
[47]
Manali Sharma and Mustafa Bilgic. 2017. Evidence-based uncertainty sampling for active learning. Data Mining and Knowledge Discovery 31 (2017), 164--202.
[48]
Alaa Tharwat and Wolfram Schenck. 2023. A Survey on Active Learning: State-of-the-Art, Practical Challenges and Research Directions. Mathematics 11, 4 (2023), 820.
[49]
Heng Wang and Zubin Abraham. 2015. Concept drift detection for streaming data. In 2015 international joint conference on neural networks (IJCNN). IEEE, 1--9.
[50]
Geoffrey I Webb, Loong Kuan Lee, Bart Goethals, and François Petitjean. 2018. Analyzing concept drift and shift from sample data. Data Mining and Knowledge Discovery 32 (2018), 1179--1199.
[51]
G. Widmer and M. Kubat. 1996. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning 23, 1 (1996), 69--101.
[52]
Hoyoung Woo and Cheong Hee Park. 2013. An efficient active learning method based on random sampling and backward deletion. In Intelligent Science and Intelligent Data Engineering: Third Sino-foreign-interchange Workshop, IScIDE 2012, Nanjing, China, October 15--17, 2012. Revised Selected Papers 3. Springer, 683--691.
[53]
Doris Xin, Hui Miao, Aditya G. Parameswaran, and Neoklis Polyzotis. 2021. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. Proceedings of the 2021 International Conference on Management of Data (2021).
[54]
Lingxiao Yuan, S. Park, Harold, and Emma Lejeune. 2022. Mechanical MNIST - Distribution Shift. https://open.bu.edu/handle/2144/44485 Last accessed on July 28, 2024.
[55]
Jan Zenisek, Florian Holzinger, and Michael Affenzeller. 2019. Machine learning based concept drift detection for predictive maintenance. Computers & Industrial Engineering 137 (2019), 106031.
[56]
Quanshi Zhang, Yu Yang, Haotian Ma, and Ying Nian Wu. 2019. Interpreting cnns via decision trees. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6261--6270.
[57]
Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming Song, and Stefano Ermon. 2021. Comparing distributions by measuring differences that affect decision making. In International Conference on Learning Representations.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 11
July 2024
1039 pages
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 30 August 2024
Published in PVLDB Volume 17, Issue 11

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 123
    Total Downloads
  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)49
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media