skip to main content
10.1145/3394486.3403094acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Geodesic Forests

Published: 20 August 2020 Publication History

Abstract

Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as major challenges in data mining tasks. A reliable way to accurately preserve nonlinear structure is to compute geodesic distances between data points. Manifold learning methods, such as Isomap, aim to preserve geodesic distances in a Riemannian manifold. However, as manifold learning algorithms operate on the ambient dimensionality of the data, the essential step of geodesic distance computation is sensitive to high-dimensional noise. Therefore, a direct application of these algorithms to high-dimensional, noisy data often yields unsatisfactory results and does not accurately capture nonlinear structure.
We propose an unsupervised random forest approach called geodesic forests (GF) to geodesic distance estimation in linear and nonlinear manifolds with noise. GF operates on low-dimensional sparse linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient fashion, we developed Fast-BIC, a fast Bayesian Information Criterion statistic for Gaussian mixture models.
We additionally propose geodesic precision and geodesic recall as novel evaluation metrics that quantify how well the geodesic distances of a latent manifold are preserved. Empirical results on simulated and real data demonstrate that GF is robust to high-dimensional noise, whereas other methods, such as Isomap, UMAP, and FLANN, quickly deteriorate in such settings. Notably, GF is able to estimate geodesic distances better than other approaches on a real connectome dataset.

References

[1]
David Arthur and Sergei Vassilvitskii. 2007. k-means+: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. dl.acm.org, 1027--1035. https://dl.acm.org/citation.cfm?id=128338
[2]
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. In Similarity Search and Applications. Springer International Publishing, 34--49. https://doi.org/10.1007/978--3--319--68474--1_3
[3]
Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M Roy, and Yee Whye Teh. 2016. The Mondrian Kernel. (June 2016). arxiv: stat.ML/1606.05241 http://arxiv.org/abs/1606.05241
[4]
Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems. 585--591.
[5]
G AvS rard Biau, Luc Devroye, and G AkA bor Lugosi. 2008. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, Vol. 9, Sep (2008), 2015--2033.
[6]
Leo Breiman. 2000. Some infinity theory for predictor ensembles. Technical Report. Technical Report 579, Statistics Dept. UCB. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.7078&rep=rep1&type=pdf
[7]
Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.
[8]
Leo Breiman, Jerome Friedman, Charles J Stone, and R A Olshen. 1984. Classification and Regression Trees (Wadsworth Statistics/Probability) 1 edition ed.). Chapman and Hall/CRC. https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418
[9]
Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina. 2008. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th international conference on Machine learning. ACM, New York, New York, USA, 96--103. https://doi.org/10.1145/1390156.1390169
[10]
Rich Caruana and Alexandru Niculescu-Mizil. 2006. An Empirical Comparison of Supervised Learning Algorithms. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML '06). ACM, New York, NY, USA, 161--168. https://doi.org/10.1145/1143844.1143865
[11]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). ACM, New York, NY, USA, 785--794. https://doi.org/10.1145/2939672.2939785
[12]
Ronald R Coifman and S Lafon. 2006. Diffusion maps. Appl. Comput. Harmon. Anal., Vol. 21, 1 (July 2006), 5--30.
[13]
A Criminisi and J Shotton. 2013. Manifold forests. In Decision Forests for Computer Vision and Medical Image Analysis. Springer, 79--93.
[14]
Sanjoy Dasgupta and Yoav Freund. 2008a. Random projection trees and low dimensional manifolds. In STOC, Vol. 8. Citeseer, 537--546.
[15]
Sanjoy Dasgupta and Yoav Freund. 2008b. Random projection trees and low dimensional manifolds. In STOC, Vol. 8. Citeseer, 537--546.
[16]
Sanjoy Dasgupta and Yoav Freund. 2008c. Random projection trees for vector quantization. IEEE Trans. Inf. Theory 7 (May 2008), 3229--3242. arxiv: stat.ML/0805.1390 http://arxiv.org/abs/0805.1390
[17]
Alex Davies and Zoubin Ghahramani. 2014. The Random Forest Kernel and other kernels for big data from random partitions. (Feb. 2014). arxiv: stat.ML/1402.4293 http://arxiv.org/abs/1402.4293
[18]
Luc Devroye, Laszlo Györfi, and Gabor Lugosi. 1997. A Probabilistic Theory of Pattern Recognition (Stochastic Modelling and Applied Probability) corrected edition ed.). Springer.
[19]
David L Donoho and Carrie Grimes. 2003. Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. U. S. A., Vol. 100, 10 (May 2003), 5591--5596.
[20]
Katharina Eichler, Feng Li, Ashok Litwin-Kumar, Youngser Park, Ingrid Andrade, Casey M Schneider-Mizell, Timo Saumweber, Annina Huser, Claire Eschbach, Bertram Gerber, et al. 2017. The complete connectome of a learning and memory centre in an insect brain. Nature, Vol. 548, 7666 (2017), 175.
[21]
Chris Fraley and Adrian E Raftery. 2002. Model-Based Clustering, Discriminant Analysis, and Density Estimation. J. Amer. Statist. Assoc., Vol. 97, 458 (June 2002), 611--631. https://doi.org/10.1198/016214502760047131
[22]
Y Freund and R E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. (1997). https://www.sciencedirect.com/science/article/pii/S002200009791504X
[23]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[24]
Wayne A Fuller. 1987. Measurement Error Models 99 edition ed.). Wiley. https://smile.amazon.com/Measurement-Error-Models-Wayne-Fuller/dp/0471861871/ref=sr_1_2?keywords=Measurement+Error+Models&qid=1561237036&s=gateway&sr=8--2
[25]
David J Hand. 2016. Measurement: A Very Short Introduction (Very Short Introductions) 1 edition ed.). Oxford University Press. https://www.amazon.com/Measurement-Very-Short-Introduction-Introductions/dp/0198779569
[26]
J A Hartigan and M A Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Applied statistics, Vol. 28, 1 (1979), 100. https://doi.org/10.2307/2346830
[27]
John A Lee and Michel Verleysen. 2007. Nonlinear Dimensionality Reduction 2007 edition ed.). Springer Science & Business Media. https://market.android.com/details?id=book-e4qkkQEACAAJ
[28]
Andy Liaw and Matthew Wiener. 2002. Classification and Regression by randomForest. R News, Vol. 2, 3 (2002), 18--22.
[29]
Laurens van der Maaten and Geoffrey Hinton. 2008a. Visualizing Data using t-SNE. J. Mach. Learn. Res., Vol. 9, Nov (2008), 2579--2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html
[30]
Laurens van der Maaten and Geoffrey Hinton. 2008b. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, Nov (2008), 2579--2605.
[31]
Leland McInnes and John Healy. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
[32]
Geoffrey McLachlan and Thriyambakam Krishnan. 2008. The EM Algorithm and Extensions 2 edition ed.). Wiley-Interscience. https://www.amazon.com/EM-Algorithm-Extensions-Geoffrey-McLachlan/dp/0471201707
[33]
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine Learning. MIT Press. https://market.android.com/details?id=book-dWB9DwAAQBAJ
[34]
Marius Muja and David G Lowe. 2014. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis & Machine Intelligence 11 (2014), 2227--2240.
[35]
Carey E Priebe, Youngser Park, Minh Tang, Avanti Athreya, Vince Lyzinski, Joshua T Vogelstein, Yichen Qin, Ben Cocanougher, Katharina Eichler, Marta Zlatic, et al. 2017. Semiparametric spectral modeling of the Drosophila connectome. arXiv preprint arXiv:1705.03297 (2017).
[36]
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In International Conference on Artificial Neural Networks. Springer, 583--588.
[37]
Bernhard Schölkopf and Alexander J Smola. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
[38]
E Scornet. 2016. Random Forests and Kernel Methods. IEEE Trans. Inf. Theory, Vol. 62, 3 (March 2016), 1485--1500. http://dx.doi.org/10.1109/TIT.2016.2514489
[39]
Cencheng Shen and Joshua T Vogelstein. 2018. Decision Forests Induce Characteristic Kernels. (Nov. 2018). arxiv: stat.ML/1812.00029 http://arxiv.org/abs/1812.00029
[40]
Tao Shi and Steve Horvath. 2006 a. Unsupervised Learning With Random Forest Predictors. J. Comput. Graph. Stat., Vol. 15, 1 (March 2006), 118--138.
[41]
Tao Shi and Steve Horvath. 2006 b. Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics, Vol. 15, 1 (2006), 118--138.
[42]
Vin D Silva and Joshua B Tenenbaum. 2003. Global Versus Local Methods in Nonlinear Dimensionality Reduction. In Advances in Neural Information Processing Systems 15, S Becker, S Thrun, and K Obermayer (Eds.). MIT Press, 721--728. http://papers.nips.cc/paper/2141-global-versus-local-methods-in-nonlinear-dimensionality-reduction.pdf
[43]
Charles J Stone. 1977. Consistent Nonparametric Regression. Annals of statistics, Vol. 5, 4 (July 1977), 595--620. https://doi.org/10.1214/aos/1176343886
[44]
Daniel L Sussman, Minh Tang, Donniell E Fishkind, and Carey E Priebe. 2012. A consistent adjacency spectral embedding for stochastic block model graphs. J. Amer. Statist. Assoc., Vol. 107, 499 (2012), 1119--1128.
[45]
Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.
[46]
William C Thibault and Bruce F Naylor. 1987. Set operations on polyhedra using binary space partitioning trees. In ACM SIGGRAPH computer graphics, Vol. 21. ACM, 153--162.
[47]
T Tomita, M Maggioni, and J Vogelstein. 2017. ROFLMAO: Robust Oblique Forests with Linear MAtrix Operations. In Proceedings of the 2017 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 498--506. https://doi.org/10.1137/1.9781611974973.56
[48]
Tyler M Tomita, Mauro Maggioni, and Joshua T Vogelstein. 2015. Randomer forests. arXiv preprint arXiv:1506.03410 (2015).
[49]
Joshua T Vogelstein, Eric W Bridgeford, Benjamin D Pedigo, Jaewon Chung, Keith Levin, Brett Mensh, and Carey E Priebe. 2019. Connectal coding: discovering the structures linking cognitive phenotypes to individual histories. Current opinion in neurobiology, Vol. 55 (2019), 199--212.
[50]
Joe H Ward. 1963. Hierarchical Grouping to Optimize an Objective Function. J. Amer. Statist. Assoc., Vol. 58, 301 (March 1963), 236--244. https://doi.org/10.1080/01621459.1963.10500845
[51]
Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, Philip S Yu, Zhi-Hua Zhou, Michael Steinbach, David J Hand, and Dan Steinberg. 2007. Top 10 Algorithms in Data Mining. Knowledge and information systems, Vol. 14, 1 (Dec. 2007), 1--37. https://doi.org/10.1007/s10115-007-0114--2
[52]
Jordan Yoder, Li Chen, Henry Pao, Eric Bridgeford, Keith Levin, Donniell Fishkind, Carey Priebe, and Vince Lyzinski. 2018. Vertex nomination: The canonical sampling and the extended spectral nomination schemes. arXiv preprint arXiv:1802.04960 (2018).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. manifold learning
  2. noisy data
  3. random forest
  4. unsupervised

Qualifiers

  • Research-article

Funding Sources

  • DARPA

Conference

KDD '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)163
  • Downloads (Last 6 weeks)34
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media