skip to main content
10.1145/872757.872795acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Efficient similarity search and classification via rank aggregation

Published: 09 June 2003 Publication History

Abstract

We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.One instantiation of our methodology is as follows. Each voter projects all the vectors (database elements and the query) on a random line (different for each voter), and ranks the database elements based on the proximity of the projections to the projection of the query. The aggregation rule picks the database element that has the best median rank. This combination has several appealing features. On the theoretical side, we prove that with high probability, it produces a result that is a (1 + ε) factor approximation to the Euclidean nearest neighbor. On the practical side, it turns out to be extremely efficient, often exploring no more than 5% of the data to obtain very high-quality results. This method is also database-friendly, in that it accesses data primarily in a pre-defined order without random accesses, and, unlike other methods for approximate nearest neighbors, requires almost no extra storage. Also, we extend our approach to deal with the k nearest neighbors.We conduct two sets of experiments to evaluate the efficacy of our methods. Our experiments include two scenarios where nearest neighbors are typically employed---similarity search and classification problems. In both cases, we study the performance of our methods with respect to several evaluation criteria, and conclude that they are uniformly excellent, both in terms of quality of results and in terms of efficiency.

References

[1]
C. Aggarwal. Hierarchical subspace sampling: A unified framework for high dimensional data reduction, selectivity estimation, and nearest neighbor search. In Proceedings of the ACM SIGMOD Conference, pages 452--463, 2002.]]
[2]
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. In Proceedings of the First Biennial Conference on Innovative Data Systems Research, 2003.]]
[3]
J. J. Bartholdi, C. A. Tovey, and M. A. Trick. The computational difficulty of manipulating an election. Social Choice and Welfare, 6(3):227--241, 1989.]]
[4]
S. Berchtold, D. Keim, and H.-P. Kriegel. The X-Tree: An index structure for high dimensional data. In Proceedings of the 22nd International Conference on Very Large Databases, pages 28--39, 1996.]]
[5]
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful? In Proceedings of the 7th International Conference on Database Theory, pages 217--235, 1999.]]
[6]
M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 380--388, 2002.]]
[7]
P. Diaconis and R. Graham. Spearman's footrule as a measure of disarray. Journal of the Royal Statistical Society, Series B, 39(2):262--268, 1977.]]
[8]
C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th International World Wide Web Conference, pages 613--622, 2001.]]
[9]
R. Fagin. Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 58:83--99, 1999.]]
[10]
R. Fagin. Combining fuzzy information: an overview. ACM SIGMOD Record, 31(2):109--118, 2002.]]
[11]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proceedings of the 20th ACM Symposium on Principles of Database Systems, pages 102--113, 2001.]]
[12]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Databases, pages 518--529, 1999.]]
[13]
J. Goldstein and R. Ramakrishnan. Contrast plots and P-sphere trees: space vs. time in nearest neighbour searches. In Proceedings of the 28th International Conference on Very Large Databases, pages 429--440, 2002.]]
[14]
E. Hemaspaandra, L. A. Hemaspaandra, and J. Rothe. Exact analysis of Dodgson elections: Lewis Carroll's 1876 voting system is complete for parallel access to NP. In Proceedings of the 24th International Colloquium on Automata, Languages, and Programming, pages 214--224, 1997.]]
[15]
I. Ilyas, W. Aref, and A. Elmagarmid. Joining ranked inputs in practice. In Proceedings of the 28th International Conference on Very Large Databases, pages 950--961, 2002.]]
[16]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604--613, 1998.]]
[17]
J. G. Kemeny. Mathematics without numbers. Daedalus, 88:571--591, 1959.]]
[18]
J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pages 599--608, 1997.]]
[19]
E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):451--474, 2000.]]
[20]
D. White and R. Jain. Similarity indexing with the SS-tree. In Proceedings of the 12th International Conference on Data Engineering, pages 516--523, 1996.]]
[21]
H. P. Young. Condorcet's theory of voting. American Political Science Review, 82:1231--1244, 1988.]]
[22]
H. P. Young and A. Levenglick. A consistent extension of Condorcet's election principle. SIAM Journal on Applied Mathematics, 35(2):285--300, 1978.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2003

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS03
Sponsor:

Acceptance Rates

SIGMOD '03 Paper Acceptance Rate 53 of 342 submissions, 15%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)9
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media