research-article

On the Complexity of Inner Product Similarity Join

Authors:

Thomas Dybdahl Ahle,

Ilya Razenshteyn,

Francesco SilvestriAuthors Info & Claims

PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pages 151 - 164

https://doi.org/10.1145/2902251.2902285

Published: 15 June 2016 Publication History

Abstract

A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are: Approximation hardness of IPS join in subquadratic time, assuming the strong exponential time hypothesis. New upper and lower bounds for (A)LSH-based algorithms. In particular, we show that asymmetry can be avoided by relaxing the LSH definition to only consider the collision probability of distinct elements. A new indexing method for IPS based on linear sketches, implying that our hardness results are not far from being tight.

Our technical contributions include new asymmetric embeddings that may be of independent interest. At the conceptual level we strive to provide greater clarity, for example by distinguishing among signed and unsigned variants of IPS join and shedding new light on the effect of asymmetry.

References

[1]

A. Abboud, R. Williams, and H. Yu. More applications of the polynomial method to algorithm design. In Proc. 26th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 218--230, 2015.

Digital Library

[2]

M. Abramowitz and I. A. Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Courier Corporation, 1964.

Digital Library

[3]

P. Achlioptas, B. Schölkopf, and K. Borgwardt. Two-locus association mapping in subquadratic time. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 726--734. ACM, 2011.

Digital Library

[4]

J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. 56th IEEE Symposium on Foundations of Computer Science (FOCS), 2015.

Digital Library

[5]

A. Andoni. High frequency moments via max-stability. Unpublished manuscript, 2012.

[6]

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.

Digital Library

[7]

A. Andoni, P. Indyk, M. Kapralov, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular distance. In Proc. 28th Conference on Neural Information Processing Systems (NIPS), 2015.

Digital Library

[8]

A. Andoni, R. Krauthgamer, and I. P. Razenshteyn. Sketching and embedding are equivalent for norms. In Proc. 47th ACM on Symposium on Theory of Computing, (STOC), pages 479--488, 2015.

Digital Library

[9]

A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. 47th Symposium on Theory of Computing (STOC), pages 793--801, 2015.

Digital Library

[10]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. International Conference on Very Large Data Bases (VLDB), pages 918--929, 2006.

Digital Library

[11]

N. Augsten and M. H. Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1--124, 2013.

Digital Library

[12]

Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In Proc. 8th ACM Conference on Recommender Systems, pages 257--264, 2014.

Digital Library

[13]

B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In Proc. ACM International Conference on Information and Knowledge Management (CIKM), pages 2174--2178, 2012.

Digital Library

[14]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. International Conference on World Wide Web (WWW), pages 131--140, 2007.

Digital Library

[15]

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34 ACM Symposium on Theory of computing (STOC), pages 380--388. ACM, 2002.

Digital Library

[16]

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th ACM Symposium on Theory of Computing (STOC), pages 380--388, 2002.

Digital Library

[17]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc.22nd International Conference on Data Engineering (ICDE), page 5, 2006.

Digital Library

[18]

Y. Chen and J. M. Patel. Efficient evaluation of all-nearest-neighbor queries. In Proc. International Conference on Data Engineering (ICDE), pages 1056--1065, 2007.

[19]

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64--78, 2001.

Digital Library

[20]

R. R. Curtin, A. G. Gray, and P. Ram. Fast exact max-kernel search. In Proc. 13th SIAM International Conference on Data Mining (SDM), pages 1--9, 2013.

[21]

A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proc. International Conference on World Wide Web (WWW), pages 271--280, 2007.

Digital Library

[22]

T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2013.

Digital Library

[23]

P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627--1645, 2010.

Digital Library

[24]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. 25th International Conference on Very Large Data Bases (VLDB), pages 518--529, 1999.

Digital Library

[25]

S. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321--350, 2012.

[26]

E. H. Jacox and H. Samet. Metric space similarity joins. ACM Transactions on Database Systems (TODS), 33(2):7, 2008.

Digital Library

[27]

Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In Proc. Joint EDBT/ICDT Workshops, pages 341--348. ACM, 2013.

Digital Library

[28]

T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Mach. Learn., 77(1):27--59, 2009.

Digital Library

[29]

M. Karppa, P. Kaski, and J. Kohonen. A faster subquadratic algorithm for finding outlier correlations. In Proc. 27th ACM-SIAM Symposium on Discrete Algorithms (SODA16), 2016.

Digital Library

[30]

N. Koenigstein, P. Ram, and Y. Shavitt. Efficient retrieval of recommendations in a matrix factorization framework. In Proc. 21st ACM International Conference on Information and Knowledge Management (CIKM), pages 535--544, 2012.

Digital Library

[31]

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, Aug. 2009.

Digital Library

[32]

K. G. Larsen and J. Nelson. The johnson-lindenstrauss lemma is optimal for linear dimensionality reduction. CoRR, abs/1411.2404, 2014.

[33]

D. Lee, J. Park, J. Shim, and S.-g. Lee. An efficient similarity join algorithm with cosine similarity predicate. In Database and Expert Systems Applications, pages 422--436. Springer, 2010.

Digital Library

[34]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endowment, 5(3):253--264, 2011.

Digital Library

[35]

Y. Low and A. X. Zheng. Fast top-k similarity queries via matrix compression. In Proc. ACM International Conference on Information and Knowledge Management (CIKM)KM, pages 2070--2074. ACM, 2012.

Digital Library

[36]

J. Lu, C. Lin, W. Wang, C. Li, and H. Wang. String similarity measures and joins with synonyms. In Proc. 2013 ACM SIGMOD International Conference on Management of Data, pages 373--384, 2013.

Digital Library

[37]

R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. In Proc. 22nd Symposium on Computational Geometry (SoCS), pages 154--157, 2006.

Digital Library

[38]

J. Nelson, H. L. Nguyen, and D. P. Woodruff. On deterministic sketching and streaming for sparse recovery and norm estimation. Linear Algebra and its Applications, 441(0):152 -- 167, 2014.

[39]

B. Neyshabur and N. Srebro. On symmetric and asymmetric LSHs for inner product search. In Proc. 32nd International Conference on Machine Learning (ICML), 2015.

[40]

R. O'Donnell, Y. Wu, and Y. Zhou. Optimal lower bounds for locality-sensitive hashing (except when q is tiny). ACM Trans. Comput. Theory, 6(1):5:1--5:13, 2014.

Digital Library

[41]

R. Pagh, N. Pham, F. Silvestri, and M. Stöckel. I/O-efficient similarity join. In Proc. 23rd European Symposium on Algorithms (ESA), pages 941--952, 2015.

[42]

R. Pagh, F. Silvestri, J. Sivertsen, and M. Skala. Approximate furthest neighbor in high dimensions. In Proc. 8th International Conference on Similarity Search and Applications (SISAP), volume 9371 of LNCS, pages 3--14, 2015.

Digital Library

[43]

P. Ram and A. G. Gray. Maximum inner-product search using cone trees. In Proc. 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 931--939, 2012.

Digital Library

[44]

V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endowment, 5(5):430--441, 2012.

Digital Library

[45]

A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Proc. 27th Conference on Neural Information Processing Systems (NIPS), pages 2321--2329, 2014.

Digital Library

[46]

A. Shrivastava and P. Li. Asymmetric minwise hashing for indexing binary inner products and set containment. In Proc. 24th International Conference on World Wide Web (WWW), pages 981--991, 2015.

Digital Library

[47]

Y. N. Silva, W. G. Aref, and M. H. Ali. The similarity join database operator. In Proc. International Conference on Data Engineering (ICDE), pages 892--903. IEEE, 2010.

[48]

N. Srebro, J. D. M. Rennie, and T. S. Jaakola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pages 1329--1336. MIT Press, 2005.

[49]

N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proc. 18th Conference on Learning Theory COLT, volume 3559 of LNCS, pages 545--560, 2005.

Digital Library

[50]

C. Teflioudi, R. Gemulla, and O. Mykytiuk. Lemp: Fast retrieval of large entries in a matrix product. In Proc. ACM SIGMOD International Conference on Management of Data, pages 107--122. ACM, 2015.

Digital Library

[51]

G. Valiant. Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem. J. ACM, 62(2):13:1--13:45, 2015.

Digital Library

[52]

J. Wang, G. Li, and J. Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proc. International Conference on Data Engineering (ICDE), pages 458--469. IEEE, 2011.

Digital Library

[53]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proc. ACM SIGMOD International Conference on Management of Data, pages 85--96. ACM, 2012.

Digital Library

[54]

Y. Wang, A. Metwally, and S. Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 829--837, 2013.

Digital Library

[55]

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proc. 24rd International Conference on Very Large Data Bases (VLDB), pages 194--205, 1998.

Digital Library

[56]

R. Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2):357--365, 2005.

Digital Library

[57]

D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10:1--157, 2014.

Digital Library

[58]

C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: an efficient method for knn join processing. In Proc. VLDB, pages 756--767, 2004.

Digital Library

[59]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proc. International Conference on World Wide Web (WWW), pages 131--140, 2008.

Digital Library

[60]

R. B. Zadeh and A. Goel. Dimension independent similarity computation. The Journal of Machine Learning Research, 14(1):1605--1626, 2013.

Digital Library

[61]

P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity search: the metric space approach, volume 32. Springer Science & Business Media, 2006.

[62]

X. Zhang, F. Zou, and W. Wang. Fastanova: an efficient algorithm for genome-wide association study. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 821--829. ACM, 2008.

Digital Library

Cited By

Ma HLi JZhang Y(2024)Reconsidering Tree based Methods for k-Maximum Inner-Product Search: The LRUS-CoverTree2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00355(4671-4684)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00355
Williams R(2024)The Orthogonal Vectors Conjecture and Non-Uniform Circuit Lower Bounds2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00088(1372-1387)Online publication date: 27-Oct-2024
https://doi.org/10.1109/FOCS61266.2024.00088
Zhao XZheng BYi XLuan XXie CZhou XJensen C(2023)FARGO: Fast Maximum Inner Product Search via Global Multi-ProbingProceedings of the VLDB Endowment10.14778/3579075.357908416:5(1100-1112)Online publication date: 6-Mar-2023
https://doi.org/10.14778/3579075.3579084
Show More Cited By

Index Terms

On the Complexity of Inner Product Similarity Join
1. Theory of computation

Recommendations

Depth-3 circuits for inner product
Abstract
What is the Σ 3 2-circuit complexity (depth 3, bottom-fanin 2) of the 2n-bit inner product function? The complexity is known to be exponential 2 α n n for some α n = Ω ( 1 ). We show that the limiting constant α ≔ lim sup α n satisfies 0.847 . . ...
More efficient queries in PCPs for NP and improved approximation hardness of maximum CSP

Samorodnitsky and Trevisan [STOC 2000, pp. 191199] proved that there exists, for every positive integer k, a PCP for NP with O(log n) randomness, query complexity 2k + k², free bit complexity 2k, completeness 1 - ε, and soundness 2^-k2 + ε. In this article, ...
It's All a Matter of Degree

We optimize multiway equijoins on relational tables using degree information. We give a new bound that uses degree information to more tightly bound the maximum output size of a query. On real data, our bound on the number of triangles in a social ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

June 2016

504 pages

ISBN:9781450341912

DOI:10.1145/2902251

General Chair:
Tova Milo
Tel Aviv University, Israel
,
Program Chair:
Wang-Chiew Tan
University of California, Santa Cruz, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Research Council

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

PODS '16 Paper Acceptance Rate 31 of 94 submissions, 33%;

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma HLi JZhang Y(2024)Reconsidering Tree based Methods for k-Maximum Inner-Product Search: The LRUS-CoverTree2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00355(4671-4684)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00355
Williams R(2024)The Orthogonal Vectors Conjecture and Non-Uniform Circuit Lower Bounds2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00088(1372-1387)Online publication date: 27-Oct-2024
https://doi.org/10.1109/FOCS61266.2024.00088
Zhao XZheng BYi XLuan XXie CZhou XJensen C(2023)FARGO: Fast Maximum Inner Product Search via Global Multi-ProbingProceedings of the VLDB Endowment10.14778/3579075.357908416:5(1100-1112)Online publication date: 6-Mar-2023
https://doi.org/10.14778/3579075.3579084
Guo WYe KQi YJia PWang P(2022)Generalized Sketches for Streaming SetsApplied Sciences10.3390/app1215736212:15(7362)Online publication date: 22-Jul-2022
https://doi.org/10.3390/app12157362
Feng WDeng DLi GLi ZIdreos SSrivastava D(2021)Allign: Aligning All-Pair Near-Duplicate Passages in Long TextsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457548(541-553)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457548
Nakama HAmagata DHara T(2021)Approximate Top-k Inner Product Join with a Proximity Graph2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671858(4468-4471)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671858
Li YWang JPullman BBandeira NPapakonstantinou Y(2021)Index-based, High-dimensional, Cosine Threshold Querying with Optimality GuaranteesTheory of Computing Systems10.1007/s00224-020-10009-665:1(42-83)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1007/s00224-020-10009-6
Lorenzen SPham N(2021)Revisiting Wedge Sampling for Budgeted Maximum Inner Product SearchMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-67658-2_25(439-455)Online publication date: 25-Feb-2021
https://doi.org/10.1007/978-3-030-67658-2_25
Zhang YWu JWang JXing C(2020)A Transformation-Based Framework for KNN Set Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.288618932:3(409-423)Online publication date: 1-Mar-2020
https://doi.org/10.1109/TKDE.2018.2886189
Ahle TKnudsen J(2020)Subsets and Supermajorities: Optimal Hashing-based Set Similarity Search2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS46700.2020.00073(728-739)Online publication date: Nov-2020
https://doi.org/10.1109/FOCS46700.2020.00073
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents