[PDF][PDF] A cost model for similarity queries in metric spaces

P Ciaccia, M Patella, P Zezula - Proceedings of the seventeenth ACM …, 1998 - dl.acm.org
Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on …, 1998dl.acm.org
Wu consider tho problem of estimating CPU (distance computntlons) nnd I/O costs for
processing range and k-nearest neighbors qucrics over metric spaces. Unlike the specific
case of vector spaces, where information on data distribution has been exploited to derive
cost models for predicting the performanco of multi-dimensional access methods, in a
generic metric space there is no such a possibility, which makes the problem quite different
and requires a novel approach. We insist that the distance distribution of objects can be …
Abstract
Wu consider tho problem of estimating CPU (distance computntlons) nnd I/O costs for processing range and k-nearest neighbors qucrics over metric spaces. Unlike the specific case of vector spaces, where information on data distribution has been exploited to derive cost models for predicting the performanco of multi-dimensional access methods, in a generic metric space there is no such a possibility, which makes the problem quite different and requires a novel approach. We insist that the distance distribution of objects can be profitably used to solve the problem, and consequently develop a concrete cost model for the M-tree access method [lo]. Our results rely on the assumption that the indexed dataset comes from a metric space which is “homogeneous” enough (in a probabilistic sense) to allow reliable cost estimations even if the distance distribution with respect to a specific query object is unknown. We experimentally validate the modol ovor both real and synthetic datasets, and show how the model can be used to tune the M-tree in order to minimlzo a combination of CPU and I/O costs. Finally, we sketch how the same approach can be applied to derive a cost model for the up-tree index structure [8].
ACM Digital Library