skip to main content
10.1145/3219819.3219999acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

Published: 19 July 2018 Publication History

Abstract

This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain_M and SUSTain_T, to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.

Supplementary Material

MP4 File (perros_sustain_unsupervised.mp4)

References

[1]
2017. Clinical Classifications Software (CCS) for ICD-9-CM. https://www.hcup-us. ahrq.gov/toolssoftware/ccs/ccs.jsp. (2017). Accessed: 2017-02--11.
[2]
Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (2007), i10--i18.
[3]
Brett W Bader and Tamara G Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (2007), 205--231.
[4]
Brett W. Bader, Tamara G. Kolda, and others. 2015. MATLAB Tensor Toolbox Version 2.6. Available online. (February 2015). http://www.sandia.gov/~tgkolda/ TensorToolbox/
[5]
Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press.
[6]
Stephen Breen and Xiao-Wen Chang. 2012. Column Reordering for Box- Constrained Integer Least Squares Problems. (April 2012).
[7]
John Brewer. 1978. Kronecker products and matrix calculus in system theory. IEEE Transactions on circuits and systems 25, 9 (1978), 772--781.
[8]
Rasmus Bro and Nicholaos D Sidiropoulos. 1998. Least squares algorithms under unimodality and non-negativity constraints. Journal of Chemometrics 12, 4 (1998), 223--247.
[9]
J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences in multidimensional scaling via an N-way generalization of "Eckart-Young" decomposition. Psychometrika 35, 3 (1970), 283--319.
[10]
Xiao-Wen Chang and Tianyang Zhou. 2007. MILES: MATLAB package for solving mixed integer least squares problems. GPS Solutions 11, 4 (2007), 289--294. Last updated: June 2016.
[11]
Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24, 2 (2016), 361--370.
[12]
Andrzej Cichocki and Anh-Huy Phan. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.
[13]
Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. 2010. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene--disease associations. Bioinformatics 26, 9 (2010), 1205--1210.
[14]
Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 126--135.
[15]
Bo Dong, Matthew M Lin, and Haesun Park. 2017. Integer Matrix Approximation and Data Mining. Journal of scientific computing (Sept. 2017), 1--27.
[16]
Nicolas Gillis and others. 2011. Nonnegative matrix factorization: Complexity, algorithms and applications. Unpublished doctoral dissertation, Université catholique de Louvain. Louvain-La-Neuve: CORE (2011).
[17]
Gene H Golub and Charles F Van Loan. 2013. Matrix Computations. Vol. 3. JHU Press.
[18]
Richard A Harshman. 1970. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-modal factor analysis. (1970).
[19]
J Henderson, J C Ho, A N Kho, J C Denny, B A Malin, J Sun, and J Ghosh. 2017. Granite: Diversified, Sparse Tensor Factorization for Electronic Health Record-Based Phenotyping. In 2017 IEEE International Conference on Healthcare Informatics (ICHI). 214--223.
[20]
Frank L Hitchcock. 1927. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics 6, 1--4 (1927), 164--189.
[21]
Joyce C Ho, Joydeep Ghosh, Steve R Steinhubl,Walter F Stewart, Joshua C Denny, Bradley A Malin, and Jimeng Sun. 2014. Limestone: high-throughput candidate phenotype generation via tensor factorization. Journal of biomedical informatics 52 (Dec. 2014), 199--211.
[22]
Joyce C Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124.
[23]
Ngoc-Diep Ho. 2008. Nonnegative matrix factorization algorithms and applications. Ph.D. Dissertation. PhD thesis, Université catholique de Louvain.
[24]
Shalmali Joshi, Suriya Gunasekar, David Sontag, and Ghosh Joydeep. 2016. Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization. In Machine Learning for Healthcare Conference. 17--41.
[25]
R. Kannan, G. Ballard, and H. Park. 2018. MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization. IEEE Transactions on Knowledge and Data Engineering 30, 3 (March 2018), 544--558.
[26]
Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. 2010. Multiverse recommendation: n-dimensional tensor factorization for contextaware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 79--86.
[27]
Jingu Kim, Yunlong He, and Haesun Park. 2014. Algorithms for Nonnegative Matrix and Tensor Factorizations: A Unified View Based on Block Coordinate Descent Framework. Journal of Global Optimization 58, 2 (Feb. 2014), 285--319.
[28]
Jingu Kim and Haesun Park. 2011. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing 33, 6 (2011), 3261--3281.
[29]
Tamara G Kolda and BrettWBader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500.
[30]
Tamara G Kolda and Dianne P O'Leary. 1998. A Semidiscrete Matrix Decomposition for Latent Semantic Indexing Information Retrieval. ACM Transactions on Information and System Security 16, 4 (Oct. 1998), 322--346.
[31]
Tamara G Kolda and Jimeng Sun. 2008. Scalable tensor decompositions for multiaspect data mining. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 363--372.
[32]
Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan. 2005. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 17, 4 (2005), 447--461.
[33]
Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan. 2006. Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Transactions on Mathematical Software (TOMS) 32, 1 (2006), 33--69.
[34]
Joseph B Leader, Sarah A Pendergrass, Anurag Verma, David J Carey, Dustin N Hartzel, Marylyn D Ritchie, and H Lester Kirchner. 2015. Contrasting association results between existing PheWAS phenotype definition methods and five validated electronic phenotypes. In AMIA Annual Symposium Proceedings, Vol. 2015. American Medical Informatics Association, 824.
[35]
D D Lee and H S Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.
[36]
Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017. Discrete Content-aware Matrix Factorization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 325--334.
[37]
Pauli Miettinen. 2011. Boolean tensor factorizations. In Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 447--456.
[38]
P Miettinen, T Mielikäinen, A Gionis, G Das, and H Mannila. 2008. The Discrete Basis Problem. IEEE transactions on knowledge and data engineering 20, 10 (Oct. 2008), 1348--1362.
[39]
Gerhard Nahler. 2009. anatomical therapeutic chemical classification system (ATC). Springer Vienna, Vienna, 8--8.
[40]
D O'Leary and S Peleg. 1983. Digital Image Compression by Outer Product Expansion. IEEE Transactions on Communications 31, 3 (March 1983), 441--444.
[41]
Ioakeim Perros, Robert Chen, Richard Vuduc, and Jimeng Sun. 2015. Sparse hierarchical tucker factorization and its application to healthcare. In Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 943--948.
[42]
Ioakeim Perros, Evangelos E Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, and Jimeng Sun. 2017. SPARTan: Scalable PARAFAC2 for Large &Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, 375--384.
[43]
Rachel L Richesson, Jimeng Sun, Jyotishman Pathak, Abel N Kho, and Joshua C Denny. 2016. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artificial intelligence in medicine 71 (July 2016), 57--61.
[44]
Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. 2009. Mining Discrete Patterns via Binary Matrix Factorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09). ACM, New York, NY, USA, 757--766.
[45]
Vergil N Slee. 1978. The International classification of diseases: ninth revision (ICD-9). Annals of internal medicine 88, 3 (1978), 424--426.
[46]
Age Smilde, Rasmus Bro, and Paul Geladi. 2005. Multi-way analysis: applications in the chemical sciences. John Wiley &Sons.
[47]
Reza Takapoui, Nicholas Moehle, Stephen Boyd, and Alberto Bemporad. 2017. A simple effective heuristic for embedded mixed-integer quadratic programming. Internat. J. Control (2017), 1--11.
[48]
Berk Ustun and Cynthia Rudin. 2017. Optimized Risk Scores. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1125--1134.
[49]
X w. Chang and Q Han. 2008. Solving Box-Constrained Integer Least Squares Problems. IEEE Transactions on Wireless Communications 7, 1 (Jan. 2008), 277--287.
[50]
Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen, Bradley A Malin, and Jimeng Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274.
[51]
Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences 113, 16 (2016), 4290--4295.
[52]
Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. 2018. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR) 50, 6 (2018), 85.
[53]
Z Zhang, T Li, C Ding, and X Zhang. 2007. Binary Matrix Factorization with Applications. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 391--400.

Cited By

View all

Index Terms

  1. SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      July 2018
      2925 pages
      ISBN:9781450355520
      DOI:10.1145/3219819
      This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. matrix factorization
      2. phenotyping
      3. tensor factorization
      4. unsupervised learning

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      KDD '18
      Sponsor:

      Acceptance Rates

      KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)85
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media