skip to main content
article

Indexing methods for approximate dictionary searching: Comparative analysis

Published: 28 May 2011 Publication History

Abstract

The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.

Supplementary Material

BZ2 File (datasetsclueweb09_v1.0.tar.bz2)
Supplemental materials file 3 of 4
BZ2 File (datasetsdna_v1.0.tar.bz2)
Supplemental materials file 2 of 4
BZ2 File (datasetssynth_v1.0.tar.bz2)
Supplemental materials file 1 of 4
README (readme.txt)
README file for the supplemental material
GZ File (articleproject_v1.0.tar.gz)
Supplemental materials file 4 of 4

References

[1]
Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N., and Rodeh, M. 2000. Text indexing and dictionary matching with one error. J. Algorithms 37, 2, 309--325.
[2]
Angell, R. C., Freund, G. E., and Willett, P. 1983. Automatic spelling correction using a trigram similarity measure. Inf. Process. Manage. 19, 4, 443--453.
[3]
Baeza-Yates, R., Cunto, W., Manber, U., and Wu, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the 5th Annual Symposium on Pattern Matching. Springer, Berlin, 198--212.
[4]
Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Commun. ACM 35, 10, 74--82.
[5]
Baeza-Yates, R. and Navarro, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the String Processing and Information Retrieval: A South American Symposium (SPIRE'98). IEEE, Los Alamitos, CA, 14--22.
[6]
Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.
[7]
Baeza-Yates, R. and Salinger, A. 2005. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proceedings of the String Processing and Information Retrieval (SPIRE'05). Springer, Berlin, 13--24.
[8]
Baeza-Yates, R. A. and Gonnet, G. H. 1990. All-against-all sequence matching: Preliminary version. http://www.dcc.uchile.cl/~rbaeza/cv/reports.html.
[9]
Baeza-Yates, R. A. and Gonnet, G. H. 1999. A fast algorithm on average for all-against-all sequence matching. In Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware (SPIRE'99). IEEE, Los Alamitos, CA. 16--23.
[10]
Barbay, J., López-Ortiz, A., and Lu, T. 2006. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA'06). Springer, Berlin, 146--157.
[11]
Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R*-tree: an efficient and robust access method for points and rectangles. In Proceedings of 1990 ACM SIGMOD International Conference on Management of Data. ACM, New York, 322--331.
[12]
Belazzougui, D. 2009. Faster and space-optimal edit distance 1 dictionary. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Springer, Berlin, 154--167.
[13]
Benoit, D., Demaine, E. D., Munro, J. I., Raman, R., Raman, V., and Rao, S. S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.
[14]
Bentley, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517.
[15]
Berchtold, S., Keim, D. A., and Kriegel, H. P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 28--39.
[16]
Berman, A. 1994. A new data structure for fast approximate matching. Tech. rep. 1994-03-02, Department of Computer Science and Engineering, University of Washington.
[17]
Bieganski, P., Riedl, J., Cartis, J., and Retzel, E. 1994. Generalized suffix trees for biological sequence data: applications and implementation. In Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. IEEE, Los Alamitos, CA, 35--44.
[18]
Böhm, C., Berchtold, S., and Keim, D. A. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.
[19]
Boitsov, L. 2004. Classification and experimental comparison of modern dictionary fuzzy search algorithms (in Russian). In Proceedings of the 6th Russian Conference on Digital Libraries (RCDL'04).
[20]
Botelho, F., Pagh, R., and Ziviani, N. 2007. Simple and space-efficient minimal perfect hash functions. In Proceedings of the 10th International Workshop on Algorithms and Data Structures. Springer, Berlin, 139--150.
[21]
Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 286--293.
[22]
Burkhard, W. and Keller, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4, 230--236.
[23]
Burkhardt, S. and Kärkkäinen, J. 2002. One-gapped q-gram filter for Levenshtein distance. In Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM'02). Springer, Berlin, 225--234.
[24]
Cánovas, R. and Navarro, G. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Springer, Berlin, 94--105.
[25]
Carterette, B. and Can, F. 2005. Comparing inverted files and signature files for searching in a large lexicon. Inf. Process. Manage. 41, 3, 613--633.
[26]
Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., and Wong, S.-S. 2006. Compressed indexes for approximate string matching. In Proceedings of the 14th Annual European Symposium (ESA'06). Springer-Verlag, Berlin, 208--219.
[27]
Chávez, E., Marroquín, J. L., and Navarro, G. 2001. Fixed Queries Array: A fast and economical data structure for proximity searching. Multimedia Tools Appl. 14, 2, 113--135.
[28]
Chávez, E. and Navarro, G. 2003. Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Inf. Process. Lett. 85, 1, 39--46.
[29]
Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquin, J. L. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--321.
[30]
Cho, J. and Rajagopalan, S. 2002. A fast regular expression indexing engine. In Proceedings of the 18th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 419--430.
[31]
Claude, F., Fariña, A., and Navarro, G. 2009. Re-Pair compression of inverted lists. CoRR abs/0911.3318.
[32]
Coelho, L. and Oliveira, A. 2006. Dotted Suffix Trees a structure for approximate text indexing. In Proceedings of the 13th International Conference on String Processing and Information Retrieval (SPIRE'06). Springer, Berlin, 329--336.
[33]
Cole, R., Gottlieb, L.-A., and Lewenstein, M. 2004. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC'04). ACM, New York, 91--100.
[34]
Collins, P. 2009. Definately* the most misspelled word in the english language (*it should be definitely). The Daily Record. June 15, 2009. http://www.dailyrecord.co.uk/news/editors-choice/2009/06/15/definately-the-most-misspelled-word-in-the-english-language-it-should-be-definitely-86908-21441847/.
[35]
Covington, M. A. 1996. An algorithm to align words for historical comparison. Comput. Ling. 22, 4, 481--496.
[36]
Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E. 2000. Incremental construction of minimal acyclic finite-state automata. Comput. Ling. 26, 1, 3--16.
[37]
Damerau, F. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3, 171--176.
[38]
D'Amore, R. J. and Mah, C. P. 1985. One-time complete indexing of text: theory and practice. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'85). ACM, New York, 155--164.
[39]
Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2001. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the 3rd International Workshop on Algorithm Engineering and Experimentation (ALENEX'01). Springer-Verlag, Berlin, 91--104.
[40]
Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput. 23, 4, 738--761.
[41]
Dömölki, B. 1968. A universal compiler system based on production rules. BIT Numer. Math. 8, 262--275.
[42]
Doster, W. 1977. Contextual postprocessing system for cooperation with a multiple-choice character-recognition system. IEEE Trans. Comput. 26, 1090--1101.
[43]
Du, M. W. and Chang, S. C. 1994. An approach to designing very fast approximate string matching algorithms. IEEE Trans. Knowl. Data Eng. 6, 4, 620--633.
[44]
Elias, P. 1974. Efficient storage and retrieval by content and address of static files. J. ACM 21, 2, 246--260.
[45]
Faloutsos, C. 1996. Searching Multimedia Databases by Content. Kluwer Academic Publisher, Dordrecht, The Netherlands.
[46]
Ferragina, P., Muthukrishnan, S., and de Berg, M. 1999. Multi-method dispatching: a geometric approach with applications to string matching problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99). ACM, New York, 483--491.
[47]
Ferragina, P. and Venturini, R. 2007. Compressed permuterm index. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 535--542.
[48]
Figueroa, K. and Fredriksson, K. 2007. Simple space-time trade-offs for AESA. In Proceedings of the 6th International Workshop on Experimental Algorithms (WEA'07). Springer, Berlin, 229--241.
[49]
Fischer, J., Mäkinen, V., and Navarro, G. 2008. An(other) entropy-bounded compressed suffix tree. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM'08). Springer-Verlag, Berlin, 152--165.
[50]
Ford, G., Hause, S., Le, D. X., and Thoma, G. R. 2001. Pattern matching techniques for correcting low confidence OCR words in a known context. In Proceedings of the 8th SPIE International Conference on Document Recognition and Retrieval. SPIE, Bellingham, WA, 241--249.
[51]
Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a sparse table with o(1) worst case access time. J. ACM 31, 3, 538--544.
[52]
Fredriksson, K. 2007. Engineering efficient metric indexes. Pattern Recogn. Lett. 28, 1, 75--84.
[53]
Friedman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3, 3, 209--226.
[54]
Gaede, V. and Günther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231.
[55]
Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Softw., Pract. Exper. 33, 11, 1035--1049.
[56]
Giladi, E., Walker, M. G., Wang, J. Z., and Volkmuth, W. 2002. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics 18, 6, 873--877.
[57]
Glass, J. R. 2003. A probabilistic framework for segment-based speech recognition. Comput. Speech Lang. 17, 2-3, 137 -- 152.
[58]
Gollapudi, S. and Panigrahy, R. 2006. A dictionary for approximate string search and longest prefix search. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). ACM, New York, 768--775.
[59]
Gorin, R. E. 1971. SPELL: Spelling check and correction program. http://pdp-10.trailing-edge.com/decuslib10-03/01/43,50270/spell.doc.html.
[60]
Gravano, L., Ipeirotis, P., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24, 4, 28--34.
[61]
Grossi, R. and Luccio, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113--120.
[62]
Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407.
[63]
Gusfield, D. 1999. Algorithms on Strings, Trees, and Sequences -- Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK.
[64]
Guttman, A. 1984. R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 47--57.
[65]
Gwehenberger, G. 1968. Anwendung einer binären verweiskettenmethode beim aufbau von listen. Elektronische Rechenanlagen 10, 223--226.
[66]
Hall, P. and Dowling, G. 1980. Approximate string matching. ACM Computing Surveys 12, 4, 381--402.
[67]
Hellerstein, J. and Pfeffer, A. 1994. The RD-tree: An index structure for sets. Tech. rep. 1252, University of Wisconsin at Madison.
[68]
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 301, 13--30.
[69]
Hyyrö, H. 2003a. Bit-parallel approximate string matching algorithms with transposition. In Proceedings of the 10th International Symposium on String Processing. Springer-Verlag, Berlin, 95--107.
[70]
Hyyrö, H. 2003b. Practical methods for approximate string matching. Ph.D. thesis, Department of Computer Science, University of Tampere, Finland.
[71]
Hyyrö, H. 2005. Bit-parallel approximate string matching algorithms with transposition. J. Discrete Algorithms 3, 2-4, 215--229.
[72]
James, E. B. and Partridge, D. P. 1973. Adaptive correction of program statements. Commun. ACM 16, 1, 27--37.
[73]
Jeong, I.-S., Park, K.-W., Kang, S.-H., and Lim, H.-S. 2010. An efficient similarity search based on indexing in large dna databases. Comput. Biol. Chem. 34, 2, 131--136.
[74]
Jokinen, P., Tarhio, J., and Ukkonen, E. 1996. A comparison of approximate string matching algorithms. Software Pract. Exp. 26, 12, 1439--1458.
[75]
Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science. Springer, Berlin, 240--248.
[76]
Kahveci, T., Ljosa, V., and Singh, A. K. 2004. Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 13, 2122--2134.
[77]
Kahveci, T. and Singh, A. 2001. An efficient index structure for string databases. In Proceedings of the 27th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA, 351--360.
[78]
Klovstad, J. and Mondshein, L. 1975. The CASPERS linguistic analysis system. IEEE Trans. Acoust. Speech Signal Process. 23, 118--123.
[79]
Knuth, D. 1973. The Art of Computer Programming. Sorting and Searching, 1st ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.
[80]
Knuth, D. 1997. The Art of Computer Programming. Sorting and Searching, 2d ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.
[81]
Kondrak, G. 2003. Phonetic alignment and similarity. Computers and the Humanities 37, 3, 273--291.
[82]
Kuenning, G., Gorin, R. E., Willisson, P., Buehring, W., and Stevens, K. 1988. International spell: a fast screen-oriented spelling checker. http://www.lasr.cs.ucla.edu/geoff/ispell.html.
[83]
Kukich, K. 1992. Technique for automatically correcting words in text. ACM Comput. Surv. 24, 2, 377--439.
[84]
Kurtz, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 156--170.
[85]
Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171.
[86]
Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163, 4, 845--848.
[87]
Lowrance, R. and Wagner, R. 1975. An extension of the string-to-string correction problem. J. ACM 22, 2, 177--183.
[88]
Maly, K. 1976. Compressed tries. Commun. ACM 19, 7, 409--415.
[89]
Manber, U. and Myers, G. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'90). Society for Industrial and Applied Mathematics, Philadephia, PA, 319--327.
[90]
Manber, U. and Wu, S. 1994a. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 50, 4, 191--197.
[91]
Manber, U. and Wu, S. 1994b. GLIMPSE: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference (WTEC'94). USENIX, Berkeley, CA, 4--4.
[92]
McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272.
[93]
Melichar, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition. IEEE, Los Alamitos, CA, 256--260.
[94]
Micó, M. L., Oncina, J., and Vidal-Ruiz, E. 1994. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15, 1, 9--17.
[95]
Mihov, S. and Schulz, K. U. 2004. Fast approximate string search in large dictionaries. Comput. Ling. 30, 4, 451--477.
[96]
Mochizuki, H. and Hayashi, Y. 2000. An efficient retrieval algorithm of compound words using extendible hashing. Int. J. Comput. Process. Oriental Lang. 13, 1, 15--33.
[97]
Moore, T. and Edelman, B. 2010. Measuring the perpetrators and funders of typosquatting. In Proceedings of the 14th International Conference on Financial Cryptography and Data Security. Springer, Berlin.
[98]
Mor, M. and Fraenkel, A. S. 1982. A hash code method for detecting and correcting spelling errors. Commun. ACM 25, 12, 935--938.
[99]
Morrison, D. R. 1968. PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4, 514--534.
[100]
Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching. Springer, Berlin, 75--86.
[101]
Myers, E. 1994. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345--374.
[102]
Myers, E. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415.
[103]
Navarro, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 95--111.
[104]
Navarro, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 112--124.
[105]
Navarro, G. 2001a. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88.
[106]
Navarro, G. 2001b. NR-grep: a fast and flexible pattern matching tool. Software Pract. Exp. 31, 13, 1265--1312.
[107]
Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1, 2.
[108]
Navarro, G. and Baeza-Yates, R. 2000. A hybrid indexing method for approximate string matching. J. Discrete Algorithms, 1, 1, 205--239.
[109]
Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engin. Bull. 24, 4, 19--27.
[110]
Navarro, G., Paredes, R., and Chávez, E. 2002. t-Spanners as a data structure for metric space searching. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE'02), Springer, Berlin, 195--200.
[111]
Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. J. Exp. Algorithmics, 5, 4.
[112]
Navarro, G. and Salmela, L. 2009. Indexing variable length substrings for exact and approximate matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Springer, Berlin, 214--221.
[113]
Navarro, G., Sutinen, E., and Tarhio, J. 2005. Indexing text with approximate q-grams. J. Discrete Algorithms 3, 2-4, 157--175.
[114]
Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 3, 443--453.
[115]
Ng, K. and Zue, V. W. 2000. Subword-based approaches for spoken document retrieval. Speech Commun. 32, 3, 157--186.
[116]
Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1, 38--71.
[117]
Nilsson, S. and Karlsson, G. 1999. IP-address lookup using LC-tries. IEEE J. Sel. Areas Commun. 17, 6, 1083--1092.
[118]
Orenstein, J. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4, 150--157.
[119]
Owolabi, O. 1996. Dictionary organizations for efficient similarity retrieval. J. Syst. Software 34, 2, 127--132.
[120]
Owolabi, O. and McGregor, D. R. 1988. Fast approximate string matching. Software Pract. Exp. 18, 4, 387--393.
[121]
Ozturk, O. and Ferhatosmanoglu, H. 2003. Effective indexing and filtering for similarity search in large biosequence databases. In Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE'03). IEEE, Los Alamitos, CA 359.
[122]
Pagh, R. and Rodler, F. 2001. Cuckoo hashing. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA '01). Springer, Berlin, 121--133.
[123]
Peterson, J. L. 1980. Computer programs for detecting and correcting spelling errors. Commun. ACM 23, 12, 676--687.
[124]
Peterson, J. L. 1986. A note on undetected typing errors. Commun. ACM 29, 7, 633--637.
[125]
Raman, R., Raman, V., and Satti, S. R. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3, 4, 43.
[126]
Riseman, E. M. and Hanson, A. R. 1974. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comput. 23, 5, 480--493.
[127]
Robinson, J. 1981. The K-D-B-Tree: A search structure for large multidimensional dynamic indexes. In Proceedings of 1981 ACM SIGMOD International Conference on Management of Data. ACM, New York, 10--18.
[128]
Roussopoulos, N. and Leifker, D. 1985. Direct spatial search on pictorial databases using packed R-trees. SIGMOD Rec. 14, 4, 17--31.
[129]
Russo, L. and Oliveira, A. 2005. An efficient algorithm for generating super condensed neighborhoods. In Proceedings of the 16th Annual Combinatorial Pattern Matching Symposium (CPM'05). Springer, Berlin, 104--115.
[130]
Russo, L. M. S., Navarro, G., and Oliveira, A. L. 2008. Fully-compressed suffix trees. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN'08), Springer, Berlin, 362--373.
[131]
Russo, L. M. S., Navarro, G., Oliveira, A. L., and Morales, P. 2009. Approximate string matching with compressed indexes. Algorithms 2, 3, 1105--1136.
[132]
Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comp. Sys. 41, 4, 589--607.
[133]
Sakoe, H. and Chiba, S. 1971. A dynamic programming approach to continuous speech recognition. In Proceedings of the 7th International Congress on Acoustics. 65--68.
[134]
Samet, H. 1995. Spatial data structures. In Modern Database Systems in The Object Model, Interoperability and Beyond. Addison-Wesley, Upper Saddle River, NJ, 361--385.
[135]
Samet, H. 2005. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco, CA.
[136]
Sankoff, D. 2000. The early introduction of dynamic programming into computational biology. Bioinformatics 16, 1, 41--47.
[137]
Schek, H. J. 1978. The reference string indexing method. In Proceedings of the 2nd Conference of the European Cooperation in Informatics. Springer-Verlag, Berlin, 432--459.
[138]
Scholer, F., Williams, H. E., Yiannis, J., and Zobel, J. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in information Retrieval. ACM, New York, 222--229.
[139]
Schuegraf, E. J. 1973. Selection of equifrequent work fragments for information retrieval. Inf. Storage Retrieval 9, 12, 697--711.
[140]
Sellers, P. 1974. On the computation of evolutionary distances. SIAM J. Appl. Math. 26, 787--793.
[141]
Sellis, T. K., Roussopoulos, N., and Faloutsos, C. 1987. The R<sup>+</sup>-Tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB'87). Morgan Kaufmann, San Francisco, CA, 507--518.
[142]
Siegler, M., Witbrock, M. J., Slattery, S. T., Seymore, K., Jones, R. E., and Hauptmann, A. G. 1997. Experiments in Spoken Document Retrieval at CMU. In Proceedings of the 6th Text REtrieval Conference. National Institute of Standard and Technology, Gaithersburg, MD, 291--302.
[143]
Skut, W. 2004. Incremental construction of minimal acyclic sequential transducers from unsorted data. In Proceedings of the 20th International Conference on Computational Linguistics (COLING'04). Association for Computational Linguistics, Stroudsburg, PA, 15.
[144]
Sung, W.-K. 2008. Indexed approximate string matching. In Encyclopedia of Algorithms. 408--410.
[145]
Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). Springer-Verlag, Berlin, 50--63.
[146]
Uhlmann, J. 1991. Satisfying general proximity similarity queries with metric trees. Inf. Process. Lett. 40, 175--179.
[147]
Ukkonen, E. 1985a. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118.
[148]
Ukkonen, E. 1985b. Finding approximate patterns in strings. J. Algorithms 6, 1, 132--137.
[149]
Ukkonen, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, Berlin, 228--242.
[150]
Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 249--260.
[151]
Velichko, V. and Zagoruyko, N. 1970. Automatic recognition of 200 words. Int. J. Man-Machine Stud. 2, 3, 223--234.
[152]
Veronis, J. 1988. Computerized correction of phonographic errors. Computers and the Humanities 22, 1, 43--56.
[153]
Vidal, E. 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognit. Lett. 4, 3, 145--147.
[154]
Vintsyuk, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 1, 52--57.
[155]
Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 1, 168--173.
[156]
Wang, B., Xie, L., and Wang, G. 2009. A two-tire index structure for approximate string matching with block moves. In Proceedings of the 2nd International Workshop on Managing Data Quality in Collaborative Information Systems and 1st International Workshop on Data and Process Provenance. Springer, Berlin, 197--211.
[157]
Weber, R., Schek, H. J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 194--205.
[158]
Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings the 14th IEEE Symposium on Switching and Automata Theory. IEEE, Los Alamitos, CA, 1--11.
[159]
Wilbur, W. J., Kim, W., and Xie, N. 2006. Spelling correction in the search engine. Information Retrieval 9, 5, 543--564.
[160]
Willet, P. 1979. Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. J. Doc. 35, 4, 296--305.
[161]
Woodland, P., Odell, J., Valtchev, V., and Young, S. 1994. Large vocabulary continuous speech recognition using HTK. In Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing. IEEE, Los Alamitos, CA, 3, 125--128.
[162]
Wu, S. and Manber, U. 1992a. Agrep -- a fast approximate matching tool. In Proceedings of the USENIX Winter Technical Conference. USENIX, Berkeley, CA, 153--162.
[163]
Wu, S. and Manber, U. 1992b. Fast text searching allowing errors. Commun. ACM 35, 10, 83--91.
[164]
Wu, S., Manber, U., and Myers, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15, 1, 50--67.
[165]
Zaliznyak, M. 2006. Telefonica's dispute over an accented domain in Chile. Multilingual Search. Sep 30, 2006. http://www.multilingual-search.com/telefonica%C2%B4s-dispute-over-an-accented-domain-in-chile/30/09/2006.
[166]
Zamora, E., Pollock, J., and Zamora, A. 1981. The use of trigram analysis for spelling error detection. Inf. Process. Manage. 17, 6, 305--316.
[167]
Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2005. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag, Berlin.
[168]
Zobel, J. and Dart, P. 1995. Finding approximate matches in large lexicons. Software Pract. Exp. 25, 2, 331--345.
[169]
Zobel, J. and Dart, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 166--172.
[170]
Zobel, J., Moffat, A., and Ramamohanarao, K. 1998. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 4, 453--490.
[171]
Zobel, J., Moffat, A., and Sacks-davis, R. 1993. Searching large lexicons for partially specified terms using compressed inverted files. In Proceedings of the 19th international Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 290--301.
[172]
Zue, V., Glass, J., Phillips, M., and Seneff, S. 1989. The MIT SUMMIT speech recognition system: a progress report. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Morristown, NJ, 179--189.

Cited By

View all

Index Terms

  1. Indexing methods for approximate dictionary searching: Comparative analysis

                              Recommendations

                              Reviews

                              Jerzy W. Jaromczyk

                              Anyone who has ever searched for a word (pattern) in a file or checked documents for spelling errors appreciates efficient algorithms that can quickly find relevant occurrences of the pattern, even if it is inexact or incomplete. This paper surveys a class of such algorithms: indexing methods for approximate searching. The paper is divided into sections that discuss fundamental concepts and definitions, describe the methods and algorithms, and provide a comparative analysis of the experimental results. The bibliography lists more than 180 references, with publication times spanning decades, from the 1960s to 2010. In spite of the voluminous material, thanks to the taxonomy introduced by the author, the presentation is clear and well organized. A rich appendix includes additional material, such as proofs of selected theorems. Numerous tables and figures augment the presentation. Furthermore, the main observations and key points are concisely itemized as lists, such as the "bird view" (Boytsov's term) of key conclusions from the experimental analysis of the search methods. The paper thoroughly describes intensive experiments that include detailed investigations of results that, on the surface, may appear to be counterintuitive. For example, with FB-tries, retrieval times for longer patterns are better than for short ones; Boytsov conducted additional experiments to understand this phenomenon. This paper would be of interest not only to computer scientists (for whom numerous open questions are listed in the conclusion section), but also to researchers and practitioners from a variety of areas, who want to select the best search methods for such tasks as searching in dictionaries of DNA sequences. Online Computing Reviews Service

                              Access critical reviews of Computing literature here

                              Become a reviewer for Computing Reviews.

                              Comments

                              Information & Contributors

                              Information

                              Published In

                              cover image ACM Journal of Experimental Algorithmics
                              ACM Journal of Experimental Algorithmics  Volume 16, Issue
                              2011
                              411 pages
                              ISSN:1084-6654
                              EISSN:1084-6654
                              DOI:10.1145/1963190
                              Issue’s Table of Contents
                              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                              Publisher

                              Association for Computing Machinery

                              New York, NY, United States

                              Publication History

                              Published: 28 May 2011
                              Published in JEA Volume 16

                              Author Tags

                              1. k-errata tree
                              2. q-gram
                              3. q-sample
                              4. Damerau-Levenshtein distance
                              5. Levenshtein distance
                              6. NR-grep
                              7. agrep
                              8. approximate searching
                              9. frequency distance
                              10. frequency vector trie
                              11. metric trees
                              12. neighborhood generation
                              13. trie

                              Qualifiers

                              • Article

                              Contributors

                              Other Metrics

                              Bibliometrics & Citations

                              Bibliometrics

                              Article Metrics

                              • Downloads (Last 12 months)82
                              • Downloads (Last 6 weeks)11
                              Reflects downloads up to 08 Feb 2025

                              Other Metrics

                              Citations

                              Cited By

                              View all

                              View Options

                              Login options

                              Full Access

                              View options

                              PDF

                              View or Download as a PDF file.

                              PDF

                              eReader

                              View online with eReader.

                              eReader

                              Figures

                              Tables

                              Media

                              Share

                              Share

                              Share this Publication link

                              Share on social media