skip to main content
10.1145/2902251.2902285acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

On the Complexity of Inner Product Similarity Join

Published: 15 June 2016 Publication History

Abstract

A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are: Approximation hardness of IPS join in subquadratic time, assuming the strong exponential time hypothesis. New upper and lower bounds for (A)LSH-based algorithms. In particular, we show that asymmetry can be avoided by relaxing the LSH definition to only consider the collision probability of distinct elements. A new indexing method for IPS based on linear sketches, implying that our hardness results are not far from being tight.
Our technical contributions include new asymmetric embeddings that may be of independent interest. At the conceptual level we strive to provide greater clarity, for example by distinguishing among signed and unsigned variants of IPS join and shedding new light on the effect of asymmetry.

References

[1]
A. Abboud, R. Williams, and H. Yu. More applications of the polynomial method to algorithm design. In Proc. 26th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 218--230, 2015.
[2]
M. Abramowitz and I. A. Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Courier Corporation, 1964.
[3]
P. Achlioptas, B. Schölkopf, and K. Borgwardt. Two-locus association mapping in subquadratic time. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 726--734. ACM, 2011.
[4]
J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. 56th IEEE Symposium on Foundations of Computer Science (FOCS), 2015.
[5]
A. Andoni. High frequency moments via max-stability. Unpublished manuscript, 2012.
[6]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.
[7]
A. Andoni, P. Indyk, M. Kapralov, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular distance. In Proc. 28th Conference on Neural Information Processing Systems (NIPS), 2015.
[8]
A. Andoni, R. Krauthgamer, and I. P. Razenshteyn. Sketching and embedding are equivalent for norms. In Proc. 47th ACM on Symposium on Theory of Computing, (STOC), pages 479--488, 2015.
[9]
A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. 47th Symposium on Theory of Computing (STOC), pages 793--801, 2015.
[10]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. International Conference on Very Large Data Bases (VLDB), pages 918--929, 2006.
[11]
N. Augsten and M. H. Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1--124, 2013.
[12]
Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In Proc. 8th ACM Conference on Recommender Systems, pages 257--264, 2014.
[13]
B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In Proc. ACM International Conference on Information and Knowledge Management (CIKM), pages 2174--2178, 2012.
[14]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. International Conference on World Wide Web (WWW), pages 131--140, 2007.
[15]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34 ACM Symposium on Theory of computing (STOC), pages 380--388. ACM, 2002.
[16]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th ACM Symposium on Theory of Computing (STOC), pages 380--388, 2002.
[17]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc.22nd International Conference on Data Engineering (ICDE), page 5, 2006.
[18]
Y. Chen and J. M. Patel. Efficient evaluation of all-nearest-neighbor queries. In Proc. International Conference on Data Engineering (ICDE), pages 1056--1065, 2007.
[19]
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64--78, 2001.
[20]
R. R. Curtin, A. G. Gray, and P. Ram. Fast exact max-kernel search. In Proc. 13th SIAM International Conference on Data Mining (SDM), pages 1--9, 2013.
[21]
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proc. International Conference on World Wide Web (WWW), pages 271--280, 2007.
[22]
T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2013.
[23]
P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627--1645, 2010.
[24]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. 25th International Conference on Very Large Data Bases (VLDB), pages 518--529, 1999.
[25]
S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321--350, 2012.
[26]
E. H. Jacox and H. Samet. Metric space similarity joins. ACM Transactions on Database Systems (TODS), 33(2):7, 2008.
[27]
Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In Proc. Joint EDBT/ICDT Workshops, pages 341--348. ACM, 2013.
[28]
T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Mach. Learn., 77(1):27--59, 2009.
[29]
M. Karppa, P. Kaski, and J. Kohonen. A faster subquadratic algorithm for finding outlier correlations. In Proc. 27th ACM-SIAM Symposium on Discrete Algorithms (SODA16), 2016.
[30]
N. Koenigstein, P. Ram, and Y. Shavitt. Efficient retrieval of recommendations in a matrix factorization framework. In Proc. 21st ACM International Conference on Information and Knowledge Management (CIKM), pages 535--544, 2012.
[31]
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, Aug. 2009.
[32]
K. G. Larsen and J. Nelson. The johnson-lindenstrauss lemma is optimal for linear dimensionality reduction. CoRR, abs/1411.2404, 2014.
[33]
D. Lee, J. Park, J. Shim, and S.-g. Lee. An efficient similarity join algorithm with cosine similarity predicate. In Database and Expert Systems Applications, pages 422--436. Springer, 2010.
[34]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endowment, 5(3):253--264, 2011.
[35]
Y. Low and A. X. Zheng. Fast top-k similarity queries via matrix compression. In Proc. ACM International Conference on Information and Knowledge Management (CIKM)KM, pages 2070--2074. ACM, 2012.
[36]
J. Lu, C. Lin, W. Wang, C. Li, and H. Wang. String similarity measures and joins with synonyms. In Proc. 2013 ACM SIGMOD International Conference on Management of Data, pages 373--384, 2013.
[37]
R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. In Proc. 22nd Symposium on Computational Geometry (SoCS), pages 154--157, 2006.
[38]
J. Nelson, H. L. Nguyen, and D. P. Woodruff. On deterministic sketching and streaming for sparse recovery and norm estimation. Linear Algebra and its Applications, 441(0):152 -- 167, 2014.
[39]
B. Neyshabur and N. Srebro. On symmetric and asymmetric LSHs for inner product search. In Proc. 32nd International Conference on Machine Learning (ICML), 2015.
[40]
R. O'Donnell, Y. Wu, and Y. Zhou. Optimal lower bounds for locality-sensitive hashing (except when q is tiny). ACM Trans. Comput. Theory, 6(1):5:1--5:13, 2014.
[41]
R. Pagh, N. Pham, F. Silvestri, and M. Stöckel. I/O-efficient similarity join. In Proc. 23rd European Symposium on Algorithms (ESA), pages 941--952, 2015.
[42]
R. Pagh, F. Silvestri, J. Sivertsen, and M. Skala. Approximate furthest neighbor in high dimensions. In Proc. 8th International Conference on Similarity Search and Applications (SISAP), volume 9371 of LNCS, pages 3--14, 2015.
[43]
P. Ram and A. G. Gray. Maximum inner-product search using cone trees. In Proc. 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 931--939, 2012.
[44]
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endowment, 5(5):430--441, 2012.
[45]
A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Proc. 27th Conference on Neural Information Processing Systems (NIPS), pages 2321--2329, 2014.
[46]
A. Shrivastava and P. Li. Asymmetric minwise hashing for indexing binary inner products and set containment. In Proc. 24th International Conference on World Wide Web (WWW), pages 981--991, 2015.
[47]
Y. N. Silva, W. G. Aref, and M. H. Ali. The similarity join database operator. In Proc. International Conference on Data Engineering (ICDE), pages 892--903. IEEE, 2010.
[48]
N. Srebro, J. D. M. Rennie, and T. S. Jaakola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pages 1329--1336. MIT Press, 2005.
[49]
N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proc. 18th Conference on Learning Theory COLT, volume 3559 of LNCS, pages 545--560, 2005.
[50]
C. Teflioudi, R. Gemulla, and O. Mykytiuk. Lemp: Fast retrieval of large entries in a matrix product. In Proc. ACM SIGMOD International Conference on Management of Data, pages 107--122. ACM, 2015.
[51]
G. Valiant. Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem. J. ACM, 62(2):13:1--13:45, 2015.
[52]
J. Wang, G. Li, and J. Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proc. International Conference on Data Engineering (ICDE), pages 458--469. IEEE, 2011.
[53]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proc. ACM SIGMOD International Conference on Management of Data, pages 85--96. ACM, 2012.
[54]
Y. Wang, A. Metwally, and S. Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 829--837, 2013.
[55]
R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proc. 24rd International Conference on Very Large Data Bases (VLDB), pages 194--205, 1998.
[56]
R. Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2):357--365, 2005.
[57]
D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10:1--157, 2014.
[58]
C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: an efficient method for knn join processing. In Proc. VLDB, pages 756--767, 2004.
[59]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proc. International Conference on World Wide Web (WWW), pages 131--140, 2008.
[60]
R. B. Zadeh and A. Goel. Dimension independent similarity computation. The Journal of Machine Learning Research, 14(1):1605--1626, 2013.
[61]
P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity search: the metric space approach, volume 32. Springer Science & Business Media, 2006.
[62]
X. Zhang, F. Zou, and W. Wang. Fastanova: an efficient algorithm for genome-wide association study. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 821--829. ACM, 2008.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2016
504 pages
ISBN:9781450341912
DOI:10.1145/2902251
  • General Chair:
  • Tova Milo,
  • Program Chair:
  • Wang-Chiew Tan
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LSH
  2. approximation hardness
  3. asymmetric embeddings
  4. inner product
  5. joins

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

PODS '16 Paper Acceptance Rate 31 of 94 submissions, 33%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media