research-article

Public Access

Geodesic Forests

Authors:

Meghana Madhyastha,

Veronika Strnadová-Neeley,

Joshua T. Vogelstein,

Carey E. PriebeAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 513 - 523

https://doi.org/10.1145/3394486.3403094

Published: 20 August 2020 Publication History

Abstract

Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as major challenges in data mining tasks. A reliable way to accurately preserve nonlinear structure is to compute geodesic distances between data points. Manifold learning methods, such as Isomap, aim to preserve geodesic distances in a Riemannian manifold. However, as manifold learning algorithms operate on the ambient dimensionality of the data, the essential step of geodesic distance computation is sensitive to high-dimensional noise. Therefore, a direct application of these algorithms to high-dimensional, noisy data often yields unsatisfactory results and does not accurately capture nonlinear structure.

We propose an unsupervised random forest approach called geodesic forests (GF) to geodesic distance estimation in linear and nonlinear manifolds with noise. GF operates on low-dimensional sparse linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient fashion, we developed Fast-BIC, a fast Bayesian Information Criterion statistic for Gaussian mixture models.

We additionally propose geodesic precision and geodesic recall as novel evaluation metrics that quantify how well the geodesic distances of a latent manifold are preserved. Empirical results on simulated and real data demonstrate that GF is robust to high-dimensional noise, whereas other methods, such as Isomap, UMAP, and FLANN, quickly deteriorate in such settings. Notably, GF is able to estimate geodesic distances better than other approaches on a real connectome dataset.

References

[1]

David Arthur and Sergei Vassilvitskii. 2007. k-means+: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. dl.acm.org, 1027--1035. https://dl.acm.org/citation.cfm?id=128338

[2]

Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2017. ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. In Similarity Search and Applications. Springer International Publishing, 34--49. https://doi.org/10.1007/978--3--319--68474--1_3

[3]

Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M Roy, and Yee Whye Teh. 2016. The Mondrian Kernel. (June 2016). arxiv: stat.ML/1606.05241 http://arxiv.org/abs/1606.05241

[4]

Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems. 585--591.

[5]

G AvS rard Biau, Luc Devroye, and G AkA bor Lugosi. 2008. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, Vol. 9, Sep (2008), 2015--2033.

Digital Library

[6]

Leo Breiman. 2000. Some infinity theory for predictor ensembles. Technical Report. Technical Report 579, Statistics Dept. UCB. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.7078&rep=rep1&type=pdf

[7]

Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.

Digital Library

[8]

Leo Breiman, Jerome Friedman, Charles J Stone, and R A Olshen. 1984. Classification and Regression Trees (Wadsworth Statistics/Probability) 1 edition ed.). Chapman and Hall/CRC. https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418

[9]

Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina. 2008. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th international conference on Machine learning. ACM, New York, New York, USA, 96--103. https://doi.org/10.1145/1390156.1390169

Digital Library

[10]

Rich Caruana and Alexandru Niculescu-Mizil. 2006. An Empirical Comparison of Supervised Learning Algorithms. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML '06). ACM, New York, NY, USA, 161--168. https://doi.org/10.1145/1143844.1143865

Digital Library

[11]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). ACM, New York, NY, USA, 785--794. https://doi.org/10.1145/2939672.2939785

Digital Library

[12]

Ronald R Coifman and S Lafon. 2006. Diffusion maps. Appl. Comput. Harmon. Anal., Vol. 21, 1 (July 2006), 5--30.

[13]

A Criminisi and J Shotton. 2013. Manifold forests. In Decision Forests for Computer Vision and Medical Image Analysis. Springer, 79--93.

[14]

Sanjoy Dasgupta and Yoav Freund. 2008a. Random projection trees and low dimensional manifolds. In STOC, Vol. 8. Citeseer, 537--546.

Digital Library

[15]

Sanjoy Dasgupta and Yoav Freund. 2008b. Random projection trees and low dimensional manifolds. In STOC, Vol. 8. Citeseer, 537--546.

Digital Library

[16]

Sanjoy Dasgupta and Yoav Freund. 2008c. Random projection trees for vector quantization. IEEE Trans. Inf. Theory 7 (May 2008), 3229--3242. arxiv: stat.ML/0805.1390 http://arxiv.org/abs/0805.1390

[17]

Alex Davies and Zoubin Ghahramani. 2014. The Random Forest Kernel and other kernels for big data from random partitions. (Feb. 2014). arxiv: stat.ML/1402.4293 http://arxiv.org/abs/1402.4293

[18]

Luc Devroye, Laszlo Györfi, and Gabor Lugosi. 1997. A Probabilistic Theory of Pattern Recognition (Stochastic Modelling and Applied Probability) corrected edition ed.). Springer.

[19]

David L Donoho and Carrie Grimes. 2003. Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. U. S. A., Vol. 100, 10 (May 2003), 5591--5596.

[20]

Katharina Eichler, Feng Li, Ashok Litwin-Kumar, Youngser Park, Ingrid Andrade, Casey M Schneider-Mizell, Timo Saumweber, Annina Huser, Claire Eschbach, Bertram Gerber, et al. 2017. The complete connectome of a learning and memory centre in an insect brain. Nature, Vol. 548, 7666 (2017), 175.

[21]

Chris Fraley and Adrian E Raftery. 2002. Model-Based Clustering, Discriminant Analysis, and Density Estimation. J. Amer. Statist. Assoc., Vol. 97, 458 (June 2002), 611--631. https://doi.org/10.1198/016214502760047131

[22]

Y Freund and R E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. (1997). https://www.sciencedirect.com/science/article/pii/S002200009791504X

[23]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[24]

Wayne A Fuller. 1987. Measurement Error Models 99 edition ed.). Wiley. https://smile.amazon.com/Measurement-Error-Models-Wayne-Fuller/dp/0471861871/ref=sr_1_2?keywords=Measurement+Error+Models&qid=1561237036&s=gateway&sr=8--2

[25]

David J Hand. 2016. Measurement: A Very Short Introduction (Very Short Introductions) 1 edition ed.). Oxford University Press. https://www.amazon.com/Measurement-Very-Short-Introduction-Introductions/dp/0198779569

[26]

J A Hartigan and M A Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Applied statistics, Vol. 28, 1 (1979), 100. https://doi.org/10.2307/2346830

[27]

John A Lee and Michel Verleysen. 2007. Nonlinear Dimensionality Reduction 2007 edition ed.). Springer Science & Business Media. https://market.android.com/details?id=book-e4qkkQEACAAJ

[28]

Andy Liaw and Matthew Wiener. 2002. Classification and Regression by randomForest. R News, Vol. 2, 3 (2002), 18--22.

[29]

Laurens van der Maaten and Geoffrey Hinton. 2008a. Visualizing Data using t-SNE. J. Mach. Learn. Res., Vol. 9, Nov (2008), 2579--2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html

[30]

Laurens van der Maaten and Geoffrey Hinton. 2008b. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, Nov (2008), 2579--2605.

[31]

Leland McInnes and John Healy. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).

[32]

Geoffrey McLachlan and Thriyambakam Krishnan. 2008. The EM Algorithm and Extensions 2 edition ed.). Wiley-Interscience. https://www.amazon.com/EM-Algorithm-Extensions-Geoffrey-McLachlan/dp/0471201707

[33]

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine Learning. MIT Press. https://market.android.com/details?id=book-dWB9DwAAQBAJ

Digital Library

[34]

Marius Muja and David G Lowe. 2014. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis & Machine Intelligence 11 (2014), 2227--2240.

[35]

Carey E Priebe, Youngser Park, Minh Tang, Avanti Athreya, Vince Lyzinski, Joshua T Vogelstein, Yichen Qin, Ben Cocanougher, Katharina Eichler, Marta Zlatic, et al. 2017. Semiparametric spectral modeling of the Drosophila connectome. arXiv preprint arXiv:1705.03297 (2017).

[36]

Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In International Conference on Artificial Neural Networks. Springer, 583--588.

Digital Library

[37]

Bernhard Schölkopf and Alexander J Smola. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.

[38]

E Scornet. 2016. Random Forests and Kernel Methods. IEEE Trans. Inf. Theory, Vol. 62, 3 (March 2016), 1485--1500. http://dx.doi.org/10.1109/TIT.2016.2514489

Digital Library

[39]

Cencheng Shen and Joshua T Vogelstein. 2018. Decision Forests Induce Characteristic Kernels. (Nov. 2018). arxiv: stat.ML/1812.00029 http://arxiv.org/abs/1812.00029

[40]

Tao Shi and Steve Horvath. 2006 a. Unsupervised Learning With Random Forest Predictors. J. Comput. Graph. Stat., Vol. 15, 1 (March 2006), 118--138.

[41]

Tao Shi and Steve Horvath. 2006 b. Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics, Vol. 15, 1 (2006), 118--138.

[42]

Vin D Silva and Joshua B Tenenbaum. 2003. Global Versus Local Methods in Nonlinear Dimensionality Reduction. In Advances in Neural Information Processing Systems 15, S Becker, S Thrun, and K Obermayer (Eds.). MIT Press, 721--728. http://papers.nips.cc/paper/2141-global-versus-local-methods-in-nonlinear-dimensionality-reduction.pdf

[43]

Charles J Stone. 1977. Consistent Nonparametric Regression. Annals of statistics, Vol. 5, 4 (July 1977), 595--620. https://doi.org/10.1214/aos/1176343886

[44]

Daniel L Sussman, Minh Tang, Donniell E Fishkind, and Carey E Priebe. 2012. A consistent adjacency spectral embedding for stochastic block model graphs. J. Amer. Statist. Assoc., Vol. 107, 499 (2012), 1119--1128.

[45]

Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.

[46]

William C Thibault and Bruce F Naylor. 1987. Set operations on polyhedra using binary space partitioning trees. In ACM SIGGRAPH computer graphics, Vol. 21. ACM, 153--162.

[47]

T Tomita, M Maggioni, and J Vogelstein. 2017. ROFLMAO: Robust Oblique Forests with Linear MAtrix Operations. In Proceedings of the 2017 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 498--506. https://doi.org/10.1137/1.9781611974973.56

[48]

Tyler M Tomita, Mauro Maggioni, and Joshua T Vogelstein. 2015. Randomer forests. arXiv preprint arXiv:1506.03410 (2015).

[49]

Joshua T Vogelstein, Eric W Bridgeford, Benjamin D Pedigo, Jaewon Chung, Keith Levin, Brett Mensh, and Carey E Priebe. 2019. Connectal coding: discovering the structures linking cognitive phenotypes to individual histories. Current opinion in neurobiology, Vol. 55 (2019), 199--212.

[50]

Joe H Ward. 1963. Hierarchical Grouping to Optimize an Objective Function. J. Amer. Statist. Assoc., Vol. 58, 301 (March 1963), 236--244. https://doi.org/10.1080/01621459.1963.10500845

[51]

Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, Philip S Yu, Zhi-Hua Zhou, Michael Steinbach, David J Hand, and Dan Steinberg. 2007. Top 10 Algorithms in Data Mining. Knowledge and information systems, Vol. 14, 1 (Dec. 2007), 1--37. https://doi.org/10.1007/s10115-007-0114--2

[52]

Jordan Yoder, Li Chen, Henry Pao, Eric Bridgeford, Keith Levin, Donniell Fishkind, Carey Priebe, and Vince Lyzinski. 2018. Vertex nomination: The canonical sampling and the extended spectral nomination schemes. arXiv preprint arXiv:1802.04960 (2018).

Cited By

Zheng YWang JYeh CFan YChen HWang LZhang W(2023)EmbeddingTree: Hierarchical Exploration of Entity Features in Embedding2023 IEEE 16th Pacific Visualization Symposium (PacificVis)10.1109/PacificVis56936.2023.00032(217-221)Online publication date: Apr-2023
https://doi.org/10.1109/PacificVis56936.2023.00032
Bobek SKuk MBrzegowski JBrzychczy ENalepa G(2022)KnAC: an approach for enhancing cluster analysis with background knowledge and explanationsApplied Intelligence10.1007/s10489-022-04310-953:12(15537-15560)Online publication date: 23-Nov-2022
https://doi.org/10.1007/s10489-022-04310-9
Gerasimiuk MShung DTong AStanley ASchultz MNgu JLaine LWolf GKrishnaswamy S(2021)MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9672045(4694-4704)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9672045

Index Terms

Geodesic Forests

Recommendations

Geodesic Discriminant Analysis on Curved Riemannian Manifold
FSKD '09: Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05

In this paper, we present a geodesic discriminant analysis(GDA) algorithm, which generalize linear discriminant analysis(LDA) in linear manifold space to curved Riemannian manifold space, with the aid of Riemannian logarithmic map. Compared with LDA, ...
Locality preserving embedding for face and handwriting digital recognition

Most supervised manifold learning-based methods preserve the original neighbor relationships to pursue the discriminating power. Thus, structure information of the data distributions might be neglected and destroyed in low-dimensional space in a certain ...
Geodesic entropic graphs for dimension and entropy estimation in manifold learning

In the manifold learning problem, one seeks to discover a smooth low dimensional surface, i.e., a manifold embedded in a higher dimensional linear vector space, based on a set of measured sample points on the surface. In this paper, we consider the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DARPA

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,023
Total Downloads

Downloads (Last 12 months)163
Downloads (Last 6 weeks)34

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zheng YWang JYeh CFan YChen HWang LZhang W(2023)EmbeddingTree: Hierarchical Exploration of Entity Features in Embedding2023 IEEE 16th Pacific Visualization Symposium (PacificVis)10.1109/PacificVis56936.2023.00032(217-221)Online publication date: Apr-2023
https://doi.org/10.1109/PacificVis56936.2023.00032
Bobek SKuk MBrzegowski JBrzychczy ENalepa G(2022)KnAC: an approach for enhancing cluster analysis with background knowledge and explanationsApplied Intelligence10.1007/s10489-022-04310-953:12(15537-15560)Online publication date: 23-Nov-2022
https://doi.org/10.1007/s10489-022-04310-9
Gerasimiuk MShung DTong AStanley ASchultz MNgu JLaine LWolf GKrishnaswamy S(2021)MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9672045(4694-4704)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9672045

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten