research-article

Outlier Summarization via Human Interpretable Rules

Authors:

Samuel MaddenAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 7

Pages 1591 - 1604

https://doi.org/10.14778/3654621.3654627

Published: 30 May 2024 Publication History

Abstract

Outlier detection is crucial for preventing financial fraud, network intrusions, and device failures. Users often expect systems to automatically summarize and interpret outlier detection results to reduce human effort and convert outliers into actionable insights. However, existing methods fail to effectively assist users in identifying the root causes of outliers, as they only pinpoint data attributes without considering outliers in the same subspace may have different causes.

To fill this gap, we propose STAIR, which learns concise and human-understandable rules to summarize and explain outlier detection results with finer granularity. These rules consider both attributes and associated values. STAIR employs an interpretation-aware optimization objective to generate a small number of rules with minimal complexity for strong interpretability. The learning algorithm of STAIR produces a rule set by iteratively splitting the large rules and is optimal in maximizing this objective in each iteration. Moreover, to effectively handle high dimensional, highly complex data sets that are hard to summarize with simple rules, we propose a localized STAIR approach, called L-STAIR. Taking data locality into consideration, it simultaneously partitions data and learns a set of localized rules for each partition. Our experimental study on many outlier benchmark datasets shows that STAIR significantly reduces the complexity of the rules required to summarize the outlier detection results, thus more amenable for humans to understand and evaluate.

References

[1]

1993. Mammography. https://www.kaggle.com/datasets/kmader/mias-mammography.

[2]

1993. Satimage-2. https://odds.cs.stonybrook.edu/satimage-2-dataset/.

[3]

1995. PageBlock. https://archive.ics.uci.edu/dataset/78/page+blocks+classification.

[4]

1998. Covertype. https://archive.ics.uci.edu/dataset/31/covertype.

[5]

1999. Spambase. http://archive.ics.uci.edu/dataset/94/spambase.

[6]

2007. Shuttle. https://archive.ics.uci.edu/dataset/148/statlog+shuttle.

[7]

2014. Pendigits. https://datahub.io/machine-learning/pendigits#readme.

[8]

2016. Pima. https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/semantic/Pima/Pima_35.html.

[9]

2017. Satellite. https://datahub.io/machine-learning/satellite#readme.

[10]

2018. Thursday-01-03. https://www.kaggle.com/datasets/karenp/original-network-traffic-thursday-01-03-2018-logs.

[11]

2023. https://github.com/baodaBBji/anonymous-Tech-Report/blob/main/Outlier_Tech_Report.pdf.

[12]

Charu C. Aggarwal. 2017. Outlier Analysis: Second Edition. Springer.

[13]

Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo I. Seltzer, and Cynthia Rudin. 2017. Learning Certifiably Optimal Rule Lists for Categorical Data. J. Mach. Learn. Res. 18 (2017), 234:1--234:78. http://jmlr.org/papers/v18/17-716.html

[14]

Fabrizio Angiulli and Clara Pizzuti. 2002. Fast Outlier Detection in High Dimensional Spaces. In PKDD. 15--26.

[15]

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, 7 (2015), e0130140.

[16]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.

[17]

Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 541--556.

Digital Library

[18]

Mridula Batra and Rashmi Agrawal. 2018. Comparative Analysis of Decision Tree Algorithms. In Nature Inspired Computing, Bijaya Ketan Panigrahi, M. N. Hoda, Vinod Sharma, and Shivendra Goel (Eds.). Springer Singapore, Singapore, 31--36.

[19]

David Biggs, Barry De Ville, and Ed Suen. 1991. A method of choosing multiway partitions for classification and decision trees. Journal of applied statistics 18, 1 (1991), 462.

[20]

L. Breiman, J. H. Friedman, R. A. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.

[21]

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD. ACM, 93--104.

Digital Library

[22]

Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-Loop Outlier Detection. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 19--33.

Digital Library

[23]

Chengliang Chai, Jiabin Liu, Nan Tang, Ju Fan, Dongjing Miao, Jiayi Wang, Yuyu Luo, and Guoliang Li. 2023. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data. Proc. ACM Manag. Data 1, 2 (2023), 157:1--157:27.

Digital Library

[24]

William W Cohen. 1995. Fast effective rule induction. In Machine learning proceedings 1995. Elsevier, 115--123.

Digital Library

[25]

Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47, 4 (2009), 547--553.

Digital Library

[26]

Yuhao Deng, Chengliang Chai, Lei Cao, Nan Tang, Ju Fan, Jiayi Wang, Ye Yuan, and Guoren Wang. 2024. MisDetect: Iterative Mislabel Detection using Early Loss. Proc. VLDB Endow. 17, 6 (2024), 1159--1172.

Digital Library

[27]

Yuhao Deng, Qiyan Deng, Chengliang Chai, Lei Cao, Nan Tang, Ju Fan, Jiayi Wang, Ye Yuan, and Guoren Wang. 2024. IDE: A System for Iterative Mislabel Detection. In Companion of the 2024 International Conference on Management of Data, SIGMOD/PODS 2024, Santiago, Chile, June 9--15, 2024. ACM.

[28]

Jack Dunn, Luca Mingardi, and Ying Daisy Zhuo. 2021. Comparing interpretability and explainability for feature selection. CoRR abs/2105.05328 (2021).

[29]

Halima Elaidi, Zahra Benabbou, and Hassan Abbar. 2018. A comparative study of algorithms constructing decision trees: ID3 and C4.5. In Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, LOPAL 2018, Rabat, Morocco, May 2--5, 2018. 26:1--26:5.

Digital Library

[30]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, Vol. 96. 226--231.

Digital Library

[31]

Wei Fu and Patrick O Perry. 2020. Estimating the number of clusters using cross-validation. Journal of Computational and Graphical Statistics 29, 1 (2020), 162--173.

[32]

Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. 2015. Fast rule mining in ontological knowledge bases with AMIE+. VLDB J. 24, 6 (2015), 707--730.

Digital Library

[33]

Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. 2013. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In 22nd International World Wide Web Conference, WWW '13, Rio de Janeiro, Brazil, May 13--17, 2013, Daniel Schwabe, Virgilio A. F. Almeida, Hartmut Glaser, Ricardo Baeza-Yates, and Sue B. Moon (Eds.). International World Wide Web Conferences Steering Committee / ACM, 413--422.

Digital Library

[34]

Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (2014), 61--72.

Digital Library

[35]

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Dino Pedreschi, Franco Turini, and Fosca Giannotti. 2018. Local Rule-Based Explanations of Black Box Decision Systems. CoRR abs/1805.10820 (2018).

[36]

Nikhil Gupta, Dhivya Eswaran, Neil Shah, Leman Akoglu, and Christos Faloutsos. 2018. Beyond Outlier Detection: LookOut for Pictorial Explanation. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2018, Dublin, Ireland, September 10--14, 2018, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11051), Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim (Eds.). Springer, 122--138.

[37]

T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, and A.Y. Wu. 2002. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 7 (2002), 881--892.

Digital Library

[38]

Gordon V Kass. 1980. An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society: Series C (Applied Statistics) 29, 2 (1980), 119--127.

[39]

Fabian Keller, Emmanuel Müller, and Klemens Böhm. 2012. HiCS: High Contrast Subspaces for Density-Based Outlier Ranking. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1--5 April, 2012, Anastasios Kementsietsidis and Marcos Antonio Vaz Salles (Eds.). IEEE Computer Society, 1037--1048.

Digital Library

[40]

Fabian Keller, Emmanuel Müller, Andreas Wixler, and Klemens Böhm. 2013. Flexible and adaptive subspace search for outlier analysis. In 22nd ACM International Conference on Information and Knowledge Management, CIKM'13, San Francisco, CA, USA, October 27 - November 1, 2013. ACM, 1381--1390.

Digital Library

[41]

Edwin M. Knorr and Raymond T. Ng. 1999. Finding Intensional Knowledge of Distance-Based Outliers. In In VLDB. 211--222.

[42]

Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In KDD. ACM, 1675--1684.

[43]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 413--422.

Digital Library

[44]

Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).

[45]

Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, Jun Wang, and Jianwei Yin. 2021. Efficient and Effective Data Imputation with Influence Functions. Proc. VLDB Endow. 15, 3 (2021), 624--632.

Digital Library

[46]

Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, and Jianwei Yin. 2023. An Experimental Survey of Missing Data Imputation Algorithms. IEEE Trans. Knowl. Data Eng. 35, 7 (2023), 6630--6650.

Digital Library

[47]

Zhengjie Miao, Qitian Zeng, Chenjie Li, Boris Glavic, Oliver Kennedy, and Sudeepa Roy. 2019. CAPE: Explaining Outliers by Counterbalancing. Proc. VLDB Endow. 12, 12 (2019), 1806--1809.

Digital Library

[48]

Yao Ming, Huamin Qu, and Enrico Bertini. 2019. RuleMatrix: Visualizing and Understanding Classifiers with Rules. IEEE Trans. Vis. Comput. Graph. 25, 1 (2019), 342--352.

Digital Library

[49]

Nikolaos Myrtakis, Vassilis Christophides, and Eric Simon. 2021. A Comparative Evaluation of Anomaly Explanation Algorithms. In Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021. OpenProceedings.org, 97--108.

[50]

Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, and Jian Pei. 2016. Discovering outlying aspects in large datasets. Data Min. Knowl. Discov. 30, 6 (2016), 1520--1555.

Digital Library

[51]

Görkem Paçaci, David Johnson, Steve McKeever, and Andreas Hamfelt. 2019. "Why Did You Do That?" - Explaining Black Box Models with Inductive Synthesis. In ICCS (5) (Lecture Notes in Computer Science, Vol. 11540). Springer, 334--345.

[52]

Dino Pedreschi, Fosca Giannotti, Riccardo Guidotti, Anna Monreale, Luca Pappalardo, Salvatore Ruggieri, and Franco Turini. 2018. Open the Black Box Data-Driven Explanation of Black Box Decision Systems. CoRR abs/1806.09936 (2018).

[53]

P Jonathon Phillips, Carina A Hahn, Peter C Fontana, David A Broniatowski, and Mark A Przybocki. 2020. Four principles of explainable artificial intelligence. Gaithersburg, Maryland (2020).

[54]

J. Ross Quinlan. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (1986), 81--106.

[55]

J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.

[56]

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algorithms for mining outliers from large data sets. In ACM SIGMOD Record, Vol. 29. ACM, 427--438.

Digital Library

[57]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In KDD. ACM, 1135--1144.

[58]

Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-Precision Model-Agnostic Explanations. In AAAI. AAAI Press, 1527--1535.

[59]

Gilbert Ritschard. 2013. CHAID and earlier supervised tree methods. In Contemporary issues in exploratory data mining in the behavioral sciences. Routledge, 70--96.

[60]

Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence 1, 5 (2019), 206--215.

[61]

S.R. Safavian and D. Landgrebe. 1991. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics 21, 3 (1991), 660--674.

[62]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Important Features Through Propagating Activation Differences. In ICML (Proceedings of Machine Learning Research, Vol. 70). PMLR, 3145--3153.

[63]

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In ICLR (Workshop Poster).

[64]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Generating Concise Entity Matching Rules. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 1635--1638.

Digital Library

[65]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing Entity Matching Rules by Examples. Proc. VLDB Endow. 11, 2 (2017), 189--202.

Digital Library

[66]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. In ICML (Proceedings of Machine Learning Research, Vol. 70). PMLR, 3319--3328.

[67]

Madhumita Sushil, Simon Suster, and Walter Daelemans. 2018. Rule induction for global explanation of trained models. In BlackboxNLP@EMNLP. Association for Computational Linguistics, 82--97.

[68]

Fulton Wang and Cynthia Rudin. 2015. Falling Rule Lists. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9--12, 2015 (JMLR Workshop and Conference Proceedings, Vol. 38), Guy Lebanon and S. V. N. Vishwanathan (Eds.). JMLR.org. http://proceedings.mlr.press/v38/wang15a.html

[69]

Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. Proc. VLDB Endow. 6, 8 (2013), 553--564.

Digital Library

[70]

Yangyang Wu, Jun Wang, Xiaoye Miao, Wenjia Wang, and Jianwei Yin. 2024. Differentiable and Scalable Generative Adversarial Models for Data Imputation. IEEE Trans. Knowl. Data Eng. 36, 2 (2024), 490--503.

Digital Library

Recommendations

Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

This paper is concerned with the problem of detecting outliers from unlabeled data. In prior work we have developed SmartSifter, which is an on-line outlier detection algorithm based on unsupervised learning from data. On the basis of SmartSifter this ...
Efficient Learning of Interpretable Classification Rules
Machine learning has become omnipresent with applications in various safety-critical domains such as medical, law, and transportation. In these domains, high-stake decisions provided by machine learning necessitate researchers to design interpretable ...
Fuzzy inference system with interpretable fuzzy rules: Advancing explainable artificial intelligence for disease diagnosis—A comprehensive review
Abstract
Interpretable artificial intelligence (AI), also known as explainable AI, is indispensable in establishing trustable AI for bench-to-bedside translation, with substantial implications for human well-being. However, the majority of existing ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 7

March 2024

260 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 30 May 2024

Published in PVLDB Volume 17, Issue 7

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
102
Total Downloads

Downloads (Last 12 months)102
Downloads (Last 6 weeks)46

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents