article

Indexing methods for approximate dictionary searching: Comparative analysis

Author:

Leonid BoytsovAuthors Info & Claims

Journal of Experimental Algorithmics (JEA), Volume 16

Article No.: 1.1, Pages 1.1 - 1.91

https://doi.org/10.1145/1963190.1963191

Published: 28 May 2011 Publication History

Get Access

Abstract

The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.

Supplementary Material

BZ2 File (datasetsclueweb09_v1.0.tar.bz2)

Supplemental materials file 3 of 4

Download
38.07 MB

BZ2 File (datasetsdna_v1.0.tar.bz2)

Supplemental materials file 2 of 4

Download
79.27 MB

BZ2 File (datasetssynth_v1.0.tar.bz2)

Supplemental materials file 1 of 4

Download
57.04 MB

README (readme.txt)

README file for the supplemental material

Download
12.01 KB

GZ File (articleproject_v1.0.tar.gz)

Supplemental materials file 4 of 4

Download
3.54 MB

References

[1]

Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N., and Rodeh, M. 2000. Text indexing and dictionary matching with one error. J. Algorithms 37, 2, 309--325.

Digital Library

Google Scholar

[2]

Angell, R. C., Freund, G. E., and Willett, P. 1983. Automatic spelling correction using a trigram similarity measure. Inf. Process. Manage. 19, 4, 443--453.

Crossref

Google Scholar

[3]

Baeza-Yates, R., Cunto, W., Manber, U., and Wu, S. 1994. Proximity matching using fixed-queries trees. In Proceedings of the 5th Annual Symposium on Pattern Matching. Springer, Berlin, 198--212.

Digital Library

Google Scholar

[4]

Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Commun. ACM 35, 10, 74--82.

Digital Library

Google Scholar

[5]

Baeza-Yates, R. and Navarro, G. 1998. Fast approximate string matching in a dictionary. In Proceedings of the String Processing and Information Retrieval: A South American Symposium (SPIRE'98). IEEE, Los Alamitos, CA, 14--22.

Google Scholar

[6]

Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.

Crossref

Google Scholar

[7]

Baeza-Yates, R. and Salinger, A. 2005. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proceedings of the String Processing and Information Retrieval (SPIRE'05). Springer, Berlin, 13--24.

Digital Library

Google Scholar

[8]

Baeza-Yates, R. A. and Gonnet, G. H. 1990. All-against-all sequence matching: Preliminary version. http://www.dcc.uchile.cl/~rbaeza/cv/reports.html.

Google Scholar

[9]

Baeza-Yates, R. A. and Gonnet, G. H. 1999. A fast algorithm on average for all-against-all sequence matching. In Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware (SPIRE'99). IEEE, Los Alamitos, CA. 16--23.

Digital Library

Google Scholar

[10]

Barbay, J., López-Ortiz, A., and Lu, T. 2006. Faster adaptive set intersections for text searching. In Proceedings of the 5th International Workshop on Experimental Algorithms (WEA'06). Springer, Berlin, 146--157.

Digital Library

Google Scholar

[11]

Beckmann, N., Kriegel, H., Schneider, R., and Seeger, B. 1990. The R&ast;-tree: an efficient and robust access method for points and rectangles. In Proceedings of 1990 ACM SIGMOD International Conference on Management of Data. ACM, New York, 322--331.

Digital Library

Google Scholar

[12]

Belazzougui, D. 2009. Faster and space-optimal edit distance 1 dictionary. In Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM'09). Springer, Berlin, 154--167.

Crossref

Google Scholar

[13]

Benoit, D., Demaine, E. D., Munro, J. I., Raman, R., Raman, V., and Rao, S. S. 2005. Representing trees of higher degree. Algorithmica 43, 4, 275--292.

Digital Library

Google Scholar

[14]

Bentley, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9, 509--517.

Digital Library

Google Scholar

[15]

Berchtold, S., Keim, D. A., and Kriegel, H. P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 28--39.

Digital Library

Google Scholar

[16]

Berman, A. 1994. A new data structure for fast approximate matching. Tech. rep. 1994-03-02, Department of Computer Science and Engineering, University of Washington.

Google Scholar

[17]

Bieganski, P., Riedl, J., Cartis, J., and Retzel, E. 1994. Generalized suffix trees for biological sequence data: applications and implementation. In Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. IEEE, Los Alamitos, CA, 35--44.

Google Scholar

[18]

Böhm, C., Berchtold, S., and Keim, D. A. 2001. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322--373.

Digital Library

Google Scholar

[19]

Boitsov, L. 2004. Classification and experimental comparison of modern dictionary fuzzy search algorithms (in Russian). In Proceedings of the 6th Russian Conference on Digital Libraries (RCDL'04).

Google Scholar

[20]

Botelho, F., Pagh, R., and Ziviani, N. 2007. Simple and space-efficient minimal perfect hash functions. In Proceedings of the 10th International Workshop on Algorithms and Data Structures. Springer, Berlin, 139--150.

Digital Library

Google Scholar

[21]

Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 286--293.

Digital Library

Google Scholar

[22]

Burkhard, W. and Keller, R. 1973. Some approaches to best-match file searching. Commun. ACM 16, 4, 230--236.

Digital Library

Google Scholar

[23]

Burkhardt, S. and Kärkkäinen, J. 2002. One-gapped q-gram filter for Levenshtein distance. In Proceedings of the 13th Symposium on Combinatorial Pattern Matching (CPM'02). Springer, Berlin, 225--234.

Digital Library

Google Scholar

[24]

Cánovas, R. and Navarro, G. 2010. Practical compressed suffix trees. In Proceedings of the 9th International Symposium on Experimental Algorithms (SEA'10). Springer, Berlin, 94--105.

Digital Library

Google Scholar

[25]

Carterette, B. and Can, F. 2005. Comparing inverted files and signature files for searching in a large lexicon. Inf. Process. Manage. 41, 3, 613--633.

Digital Library

Google Scholar

[26]

Chan, H.-L., Lam, T.-W., Sung, W.-K., Tam, S.-L., and Wong, S.-S. 2006. Compressed indexes for approximate string matching. In Proceedings of the 14th Annual European Symposium (ESA'06). Springer-Verlag, Berlin, 208--219.

Digital Library

Google Scholar

[27]

Chávez, E., Marroquín, J. L., and Navarro, G. 2001. Fixed Queries Array: A fast and economical data structure for proximity searching. Multimedia Tools Appl. 14, 2, 113--135.

Digital Library

Google Scholar

[28]

Chávez, E. and Navarro, G. 2003. Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Inf. Process. Lett. 85, 1, 39--46.

Digital Library

Google Scholar

[29]

Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquin, J. L. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273--321.

Digital Library

Google Scholar

[30]

Cho, J. and Rajagopalan, S. 2002. A fast regular expression indexing engine. In Proceedings of the 18th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 419--430.

Digital Library

Google Scholar

[31]

Claude, F., Fariña, A., and Navarro, G. 2009. Re-Pair compression of inverted lists. CoRR abs/0911.3318.

Google Scholar

[32]

Coelho, L. and Oliveira, A. 2006. Dotted Suffix Trees a structure for approximate text indexing. In Proceedings of the 13th International Conference on String Processing and Information Retrieval (SPIRE'06). Springer, Berlin, 329--336.

Digital Library

Google Scholar

[33]

Cole, R., Gottlieb, L.-A., and Lewenstein, M. 2004. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC'04). ACM, New York, 91--100.

Digital Library

Google Scholar

[34]

Collins, P. 2009. Definately&ast; the most misspelled word in the english language (&ast;it should be definitely). The Daily Record. June 15, 2009. http://www.dailyrecord.co.uk/news/editors-choice/2009/06/15/definately-the-most-misspelled-word-in-the-english-language-it-should-be-definitely-86908-21441847/.

Google Scholar

[35]

Covington, M. A. 1996. An algorithm to align words for historical comparison. Comput. Ling. 22, 4, 481--496.

Digital Library

Google Scholar

[36]

Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E. 2000. Incremental construction of minimal acyclic finite-state automata. Comput. Ling. 26, 1, 3--16.

Digital Library

Google Scholar

[37]

Damerau, F. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3, 171--176.

Digital Library

Google Scholar

[38]

D'Amore, R. J. and Mah, C. P. 1985. One-time complete indexing of text: theory and practice. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'85). ACM, New York, 155--164.

Digital Library

Google Scholar

[39]

Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2001. Experiments on adaptive set intersections for text retrieval systems. In Revised Papers from the 3rd International Workshop on Algorithm Engineering and Experimentation (ALENEX'01). Springer-Verlag, Berlin, 91--104.

Digital Library

Google Scholar

[40]

Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput. 23, 4, 738--761.

Digital Library

Google Scholar

[41]

Dömölki, B. 1968. A universal compiler system based on production rules. BIT Numer. Math. 8, 262--275.

Crossref

Google Scholar

[42]

Doster, W. 1977. Contextual postprocessing system for cooperation with a multiple-choice character-recognition system. IEEE Trans. Comput. 26, 1090--1101.

Digital Library

Google Scholar

[43]

Du, M. W. and Chang, S. C. 1994. An approach to designing very fast approximate string matching algorithms. IEEE Trans. Knowl. Data Eng. 6, 4, 620--633.

Digital Library

Google Scholar

[44]

Elias, P. 1974. Efficient storage and retrieval by content and address of static files. J. ACM 21, 2, 246--260.

Digital Library

Google Scholar

[45]

Faloutsos, C. 1996. Searching Multimedia Databases by Content. Kluwer Academic Publisher, Dordrecht, The Netherlands.

Digital Library

Google Scholar

[46]

Ferragina, P., Muthukrishnan, S., and de Berg, M. 1999. Multi-method dispatching: a geometric approach with applications to string matching problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99). ACM, New York, 483--491.

Digital Library

Google Scholar

[47]

Ferragina, P. and Venturini, R. 2007. Compressed permuterm index. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 535--542.

Digital Library

Google Scholar

[48]

Figueroa, K. and Fredriksson, K. 2007. Simple space-time trade-offs for AESA. In Proceedings of the 6th International Workshop on Experimental Algorithms (WEA'07). Springer, Berlin, 229--241.

Digital Library

Google Scholar

[49]

Fischer, J., Mäkinen, V., and Navarro, G. 2008. An(other) entropy-bounded compressed suffix tree. In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM'08). Springer-Verlag, Berlin, 152--165.

Digital Library

Google Scholar

[50]

Ford, G., Hause, S., Le, D. X., and Thoma, G. R. 2001. Pattern matching techniques for correcting low confidence OCR words in a known context. In Proceedings of the 8th SPIE International Conference on Document Recognition and Retrieval. SPIE, Bellingham, WA, 241--249.

Google Scholar

[51]

Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a sparse table with o(1) worst case access time. J. ACM 31, 3, 538--544.

Digital Library

Google Scholar

[52]

Fredriksson, K. 2007. Engineering efficient metric indexes. Pattern Recogn. Lett. 28, 1, 75--84.

Digital Library

Google Scholar

[53]

Friedman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3, 3, 209--226.

Digital Library

Google Scholar

[54]

Gaede, V. and Günther, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170--231.

Digital Library

Google Scholar

[55]

Giegerich, R., Kurtz, S., and Stoye, J. 2003. Efficient implementation of lazy suffix trees. Softw., Pract. Exper. 33, 11, 1035--1049.

Crossref

Google Scholar

[56]

Giladi, E., Walker, M. G., Wang, J. Z., and Volkmuth, W. 2002. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics 18, 6, 873--877.

Crossref

Google Scholar

[57]

Glass, J. R. 2003. A probabilistic framework for segment-based speech recognition. Comput. Speech Lang. 17, 2-3, 137 -- 152.

Crossref

Google Scholar

[58]

Gollapudi, S. and Panigrahy, R. 2006. A dictionary for approximate string search and longest prefix search. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). ACM, New York, 768--775.

Digital Library

Google Scholar

[59]

Gorin, R. E. 1971. SPELL: Spelling check and correction program. http://pdp-10.trailing-edge.com/decuslib10-03/01/43,50270/spell.doc.html.

Google Scholar

[60]

Gravano, L., Ipeirotis, P., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a DBMS for approximate string processing. IEEE Data Eng. Bull. 24, 4, 28--34.

Google Scholar

[61]

Grossi, R. and Luccio, F. 1989. Simple and efficient string matching with k mismatches. Inf. Process. Lett. 33, 3, 113--120.

Crossref

Google Scholar

[62]

Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407.

Digital Library

Google Scholar

[63]

Gusfield, D. 1999. Algorithms on Strings, Trees, and Sequences -- Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK.

Digital Library

Google Scholar

[64]

Guttman, A. 1984. R-trees: a dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 47--57.

Digital Library

Google Scholar

[65]

Gwehenberger, G. 1968. Anwendung einer binären verweiskettenmethode beim aufbau von listen. Elektronische Rechenanlagen 10, 223--226.

Google Scholar

[66]

Hall, P. and Dowling, G. 1980. Approximate string matching. ACM Computing Surveys 12, 4, 381--402.

Digital Library

Google Scholar

[67]

Hellerstein, J. and Pfeffer, A. 1994. The RD-tree: An index structure for sets. Tech. rep. 1252, University of Wisconsin at Madison.

Google Scholar

[68]

Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 301, 13--30.

Crossref

Google Scholar

[69]

Hyyrö, H. 2003a. Bit-parallel approximate string matching algorithms with transposition. In Proceedings of the 10th International Symposium on String Processing. Springer-Verlag, Berlin, 95--107.

Crossref

Google Scholar

[70]

Hyyrö, H. 2003b. Practical methods for approximate string matching. Ph.D. thesis, Department of Computer Science, University of Tampere, Finland.

Google Scholar

[71]

Hyyrö, H. 2005. Bit-parallel approximate string matching algorithms with transposition. J. Discrete Algorithms 3, 2-4, 215--229.

Crossref

Google Scholar

[72]

James, E. B. and Partridge, D. P. 1973. Adaptive correction of program statements. Commun. ACM 16, 1, 27--37.

Digital Library

Google Scholar

[73]

Jeong, I.-S., Park, K.-W., Kang, S.-H., and Lim, H.-S. 2010. An efficient similarity search based on indexing in large dna databases. Comput. Biol. Chem. 34, 2, 131--136.

Digital Library

Google Scholar

[74]

Jokinen, P., Tarhio, J., and Ukkonen, E. 1996. A comparison of approximate string matching algorithms. Software Pract. Exp. 26, 12, 1439--1458.

Digital Library

Google Scholar

[75]

Jokinen, P. and Ukkonen, E. 1991. Two algorithms for approximate string matching in static texts. In Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science. Springer, Berlin, 240--248.

Google Scholar

[76]

Kahveci, T., Ljosa, V., and Singh, A. K. 2004. Speeding up whole-genome alignment by indexing frequency vectors. Bioinformatics 20, 13, 2122--2134.

Digital Library

Google Scholar

[77]

Kahveci, T. and Singh, A. 2001. An efficient index structure for string databases. In Proceedings of the 27th International Conference on Very Large Databases. Morgan Kaufmann, San Francisco, CA, 351--360.

Digital Library

Google Scholar

[78]

Klovstad, J. and Mondshein, L. 1975. The CASPERS linguistic analysis system. IEEE Trans. Acoust. Speech Signal Process. 23, 118--123.

Crossref

Google Scholar

[79]

Knuth, D. 1973. The Art of Computer Programming. Sorting and Searching, 1st ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.

Google Scholar

[80]

Knuth, D. 1997. The Art of Computer Programming. Sorting and Searching, 2d ed., vol. 3. Addison-Wesley, Upper Saddle River, NJ.

Google Scholar

[81]

Kondrak, G. 2003. Phonetic alignment and similarity. Computers and the Humanities 37, 3, 273--291.

Crossref

Google Scholar

[82]

Kuenning, G., Gorin, R. E., Willisson, P., Buehring, W., and Stevens, K. 1988. International spell: a fast screen-oriented spelling checker. http://www.lasr.cs.ucla.edu/geoff/ispell.html.

Google Scholar

[83]

Kukich, K. 1992. Technique for automatically correcting words in text. ACM Comput. Surv. 24, 2, 377--439.

Digital Library

Google Scholar

[84]

Kurtz, S. 1996. Approximate string searching under weighted edit distance. In Proceedings of the 3rd South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 156--170.

Google Scholar

[85]

Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171.

Digital Library

Google Scholar

[86]

Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163, 4, 845--848.

Google Scholar

[87]

Lowrance, R. and Wagner, R. 1975. An extension of the string-to-string correction problem. J. ACM 22, 2, 177--183.

Digital Library

Google Scholar

[88]

Maly, K. 1976. Compressed tries. Commun. ACM 19, 7, 409--415.

Digital Library

Google Scholar

[89]

Manber, U. and Myers, G. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'90). Society for Industrial and Applied Mathematics, Philadephia, PA, 319--327.

Digital Library

Google Scholar

[90]

Manber, U. and Wu, S. 1994a. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 50, 4, 191--197.

Digital Library

Google Scholar

[91]

Manber, U. and Wu, S. 1994b. GLIMPSE: a tool to search through entire file systems. In Proceedings of the USENIX Winter 1994 Technical Conference (WTEC'94). USENIX, Berkeley, CA, 4--4.

Digital Library

Google Scholar

[92]

McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272.

Digital Library

Google Scholar

[93]

Melichar, B. 1996. String matching with k differences by finite automata. In Proceedings of the International Congress on Pattern Recognition. IEEE, Los Alamitos, CA, 256--260.

Digital Library

Google Scholar

[94]

Micó, M. L., Oncina, J., and Vidal-Ruiz, E. 1994. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recogn. Lett. 15, 1, 9--17.

Digital Library

Google Scholar

[95]

Mihov, S. and Schulz, K. U. 2004. Fast approximate string search in large dictionaries. Comput. Ling. 30, 4, 451--477.

Digital Library

Google Scholar

[96]

Mochizuki, H. and Hayashi, Y. 2000. An efficient retrieval algorithm of compound words using extendible hashing. Int. J. Comput. Process. Oriental Lang. 13, 1, 15--33.

Crossref

Google Scholar

[97]

Moore, T. and Edelman, B. 2010. Measuring the perpetrators and funders of typosquatting. In Proceedings of the 14th International Conference on Financial Cryptography and Data Security. Springer, Berlin.

Digital Library

Google Scholar

[98]

Mor, M. and Fraenkel, A. S. 1982. A hash code method for detecting and correcting spelling errors. Commun. ACM 25, 12, 935--938.

Digital Library

Google Scholar

[99]

Morrison, D. R. 1968. PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4, 514--534.

Digital Library

Google Scholar

[100]

Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching. Springer, Berlin, 75--86.

Digital Library

Google Scholar

[101]

Myers, E. 1994. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 4/5, 345--374.

Google Scholar

[102]

Myers, E. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3, 395--415.

Digital Library

Google Scholar

[103]

Navarro, G. 1997a. Multiple approximate string matching by counting. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 95--111.

Google Scholar

[104]

Navarro, G. 1997b. A partial deterministic automaton for approximate string matching. In Proceedings of the 4th South American Workshop on String Processing. Carleton University Press, Ottawa, Ontario, 112--124.

Google Scholar

[105]

Navarro, G. 2001a. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88.

Digital Library

Google Scholar

[106]

Navarro, G. 2001b. NR-grep: a fast and flexible pattern matching tool. Software Pract. Exp. 31, 13, 1265--1312.

Digital Library

Google Scholar

[107]

Navarro, G. and Baeza-Yates, R. 1998. A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1, 2.

Google Scholar

[108]

Navarro, G. and Baeza-Yates, R. 2000. A hybrid indexing method for approximate string matching. J. Discrete Algorithms, 1, 1, 205--239.

Google Scholar

[109]

Navarro, G., Baeza-Yates, R., Sutinen, E., and Tarhio, J. 2001. Indexing methods for approximate string matching. IEEE Data Engin. Bull. 24, 4, 19--27.

Google Scholar

[110]

Navarro, G., Paredes, R., and Chávez, E. 2002. t-Spanners as a data structure for metric space searching. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (SPIRE'02), Springer, Berlin, 195--200.

Digital Library

Google Scholar

[111]

Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. J. Exp. Algorithmics, 5, 4.

Digital Library

Google Scholar

[112]

Navarro, G. and Salmela, L. 2009. Indexing variable length substrings for exact and approximate matching. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE'09). Springer, Berlin, 214--221.

Digital Library

Google Scholar

[113]

Navarro, G., Sutinen, E., and Tarhio, J. 2005. Indexing text with approximate q-grams. J. Discrete Algorithms 3, 2-4, 157--175.

Crossref

Google Scholar

[114]

Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 3, 443--453.

Crossref

Google Scholar

[115]

Ng, K. and Zue, V. W. 2000. Subword-based approaches for spoken document retrieval. Speech Commun. 32, 3, 157--186.

Digital Library

Google Scholar

[116]

Nievergelt, J., Hinterberger, H., and Sevcik, K. C. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1, 38--71.

Digital Library

Google Scholar

[117]

Nilsson, S. and Karlsson, G. 1999. IP-address lookup using LC-tries. IEEE J. Sel. Areas Commun. 17, 6, 1083--1092.

Digital Library

Google Scholar

[118]

Orenstein, J. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4, 150--157.

Crossref

Google Scholar

[119]

Owolabi, O. 1996. Dictionary organizations for efficient similarity retrieval. J. Syst. Software 34, 2, 127--132.

Digital Library

Google Scholar

[120]

Owolabi, O. and McGregor, D. R. 1988. Fast approximate string matching. Software Pract. Exp. 18, 4, 387--393.

Digital Library

Google Scholar

[121]

Ozturk, O. and Ferhatosmanoglu, H. 2003. Effective indexing and filtering for similarity search in large biosequence databases. In Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering (BIBE'03). IEEE, Los Alamitos, CA 359.

Digital Library

Google Scholar

[122]

Pagh, R. and Rodler, F. 2001. Cuckoo hashing. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA '01). Springer, Berlin, 121--133.

Digital Library

Google Scholar

[123]

Peterson, J. L. 1980. Computer programs for detecting and correcting spelling errors. Commun. ACM 23, 12, 676--687.

Digital Library

Google Scholar

[124]

Peterson, J. L. 1986. A note on undetected typing errors. Commun. ACM 29, 7, 633--637.

Digital Library

Google Scholar

[125]

Raman, R., Raman, V., and Satti, S. R. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3, 4, 43.

Digital Library

Google Scholar

[126]

Riseman, E. M. and Hanson, A. R. 1974. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comput. 23, 5, 480--493.

Digital Library

Google Scholar

[127]

Robinson, J. 1981. The K-D-B-Tree: A search structure for large multidimensional dynamic indexes. In Proceedings of 1981 ACM SIGMOD International Conference on Management of Data. ACM, New York, 10--18.

Digital Library

Google Scholar

[128]

Roussopoulos, N. and Leifker, D. 1985. Direct spatial search on pictorial databases using packed R-trees. SIGMOD Rec. 14, 4, 17--31.

Digital Library

Google Scholar

[129]

Russo, L. and Oliveira, A. 2005. An efficient algorithm for generating super condensed neighborhoods. In Proceedings of the 16th Annual Combinatorial Pattern Matching Symposium (CPM'05). Springer, Berlin, 104--115.

Digital Library

Google Scholar

[130]

Russo, L. M. S., Navarro, G., and Oliveira, A. L. 2008. Fully-compressed suffix trees. In Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN'08), Springer, Berlin, 362--373.

Digital Library

Google Scholar

[131]

Russo, L. M. S., Navarro, G., Oliveira, A. L., and Morales, P. 2009. Approximate string matching with compressed indexes. Algorithms 2, 3, 1105--1136.

Crossref

Google Scholar

[132]

Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comp. Sys. 41, 4, 589--607.

Digital Library

Google Scholar

[133]

Sakoe, H. and Chiba, S. 1971. A dynamic programming approach to continuous speech recognition. In Proceedings of the 7th International Congress on Acoustics. 65--68.

Google Scholar

[134]

Samet, H. 1995. Spatial data structures. In Modern Database Systems in The Object Model, Interoperability and Beyond. Addison-Wesley, Upper Saddle River, NJ, 361--385.

Digital Library

Google Scholar

[135]

Samet, H. 2005. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, San Francisco, CA.

Digital Library

Google Scholar

[136]

Sankoff, D. 2000. The early introduction of dynamic programming into computational biology. Bioinformatics 16, 1, 41--47.

Crossref

Google Scholar

[137]

Schek, H. J. 1978. The reference string indexing method. In Proceedings of the 2nd Conference of the European Cooperation in Informatics. Springer-Verlag, Berlin, 432--459.

Digital Library

Google Scholar

[138]

Scholer, F., Williams, H. E., Yiannis, J., and Zobel, J. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in information Retrieval. ACM, New York, 222--229.

Digital Library

Google Scholar

[139]

Schuegraf, E. J. 1973. Selection of equifrequent work fragments for information retrieval. Inf. Storage Retrieval 9, 12, 697--711.

Crossref

Google Scholar

[140]

Sellers, P. 1974. On the computation of evolutionary distances. SIAM J. Appl. Math. 26, 787--793.

Crossref

Google Scholar

[141]

Sellis, T. K., Roussopoulos, N., and Faloutsos, C. 1987. The R<sup>+</sup>-Tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB'87). Morgan Kaufmann, San Francisco, CA, 507--518.

Digital Library

Google Scholar

[142]

Siegler, M., Witbrock, M. J., Slattery, S. T., Seymore, K., Jones, R. E., and Hauptmann, A. G. 1997. Experiments in Spoken Document Retrieval at CMU. In Proceedings of the 6th Text REtrieval Conference. National Institute of Standard and Technology, Gaithersburg, MD, 291--302.

Google Scholar

[143]

Skut, W. 2004. Incremental construction of minimal acyclic sequential transducers from unsorted data. In Proceedings of the 20th International Conference on Computational Linguistics (COLING'04). Association for Computational Linguistics, Stroudsburg, PA, 15.

Digital Library

Google Scholar

[144]

Sung, W.-K. 2008. Indexed approximate string matching. In Encyclopedia of Algorithms. 408--410.

Google Scholar

[145]

Sutinen, E. and Tarhio, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). Springer-Verlag, Berlin, 50--63.

Digital Library

Google Scholar

[146]

Uhlmann, J. 1991. Satisfying general proximity similarity queries with metric trees. Inf. Process. Lett. 40, 175--179.

Crossref

Google Scholar

[147]

Ukkonen, E. 1985a. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118.

Digital Library

Google Scholar

[148]

Ukkonen, E. 1985b. Finding approximate patterns in strings. J. Algorithms 6, 1, 132--137.

Crossref

Google Scholar

[149]

Ukkonen, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, Berlin, 228--242.

Digital Library

Google Scholar

[150]

Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica 14, 249--260.

Digital Library

Google Scholar

[151]

Velichko, V. and Zagoruyko, N. 1970. Automatic recognition of 200 words. Int. J. Man-Machine Stud. 2, 3, 223--234.

Crossref

Google Scholar

[152]

Veronis, J. 1988. Computerized correction of phonographic errors. Computers and the Humanities 22, 1, 43--56.

Crossref

Google Scholar

[153]

Vidal, E. 1986. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognit. Lett. 4, 3, 145--147.

Digital Library

Google Scholar

[154]

Vintsyuk, T. 1968. Speech discrimination by dynamic programming. Cybernetics 4, 1, 52--57.

Crossref

Google Scholar

[155]

Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 1, 168--173.

Digital Library

Google Scholar

[156]

Wang, B., Xie, L., and Wang, G. 2009. A two-tire index structure for approximate string matching with block moves. In Proceedings of the 2nd International Workshop on Managing Data Quality in Collaborative Information Systems and 1st International Workshop on Data and Process Provenance. Springer, Berlin, 197--211.

Digital Library

Google Scholar

[157]

Weber, R., Schek, H. J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 194--205.

Digital Library

Google Scholar

[158]

Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings the 14th IEEE Symposium on Switching and Automata Theory. IEEE, Los Alamitos, CA, 1--11.

Digital Library

Google Scholar

[159]

Wilbur, W. J., Kim, W., and Xie, N. 2006. Spelling correction in the search engine. Information Retrieval 9, 5, 543--564.

Digital Library

Google Scholar

[160]

Willet, P. 1979. Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. J. Doc. 35, 4, 296--305.

Crossref

Google Scholar

[161]

Woodland, P., Odell, J., Valtchev, V., and Young, S. 1994. Large vocabulary continuous speech recognition using HTK. In Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing. IEEE, Los Alamitos, CA, 3, 125--128.

Google Scholar

[162]

Wu, S. and Manber, U. 1992a. Agrep -- a fast approximate matching tool. In Proceedings of the USENIX Winter Technical Conference. USENIX, Berkeley, CA, 153--162.

Google Scholar

[163]

Wu, S. and Manber, U. 1992b. Fast text searching allowing errors. Commun. ACM 35, 10, 83--91.

Digital Library

Google Scholar

[164]

Wu, S., Manber, U., and Myers, E. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15, 1, 50--67.

Digital Library

Google Scholar

[165]

Zaliznyak, M. 2006. Telefonica's dispute over an accented domain in Chile. Multilingual Search. Sep 30, 2006. http://www.multilingual-search.com/telefonica%C2%B4s-dispute-over-an-accented-domain-in-chile/30/09/2006.

Google Scholar

[166]

Zamora, E., Pollock, J., and Zamora, A. 1981. The use of trigram analysis for spelling error detection. Inf. Process. Manage. 17, 6, 305--316.

Crossref

Google Scholar

[167]

Zezula, P., Amato, G., Dohnal, V., and Batko, M. 2005. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag, Berlin.

Crossref

Google Scholar

[168]

Zobel, J. and Dart, P. 1995. Finding approximate matches in large lexicons. Software Pract. Exp. 25, 2, 331--345.

Digital Library

Google Scholar

[169]

Zobel, J. and Dart, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 166--172.

Digital Library

Google Scholar

[170]

Zobel, J., Moffat, A., and Ramamohanarao, K. 1998. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 4, 453--490.

Digital Library

Google Scholar

[171]

Zobel, J., Moffat, A., and Sacks-davis, R. 1993. Searching large lexicons for partially specified terms using compressed inverted files. In Proceedings of the 19th international Conference on Very Large Data Bases. Morgan Kaufmann, San Francisco, CA, 290--301.

Digital Library

Google Scholar

[172]

Zue, V., Glass, J., Phillips, M., and Seneff, S. 1989. The MIT SUMMIT speech recognition system: a progress report. In Proceedings of the Workshop on Speech and Natural Language (HLT '89). Association for Computational Linguistics, Morristown, NJ, 179--189.

Digital Library

Google Scholar

Cited By

View all

Deshmane AMartínez-de-Albéniz V(2025)Play it Again, Sam? Reference-Point Formation and Product Differentiation in the Music IndustryManagement Science10.1287/mnsc.2022.03319Online publication date: 27-Jan-2025
https://doi.org/10.1287/mnsc.2022.03319
Spisak NAthènes GDupic TMora TWalczak A(2024)Combining mutation and recombination statistics to infer clonal families in antibody repertoireseLife10.7554/eLife.8618113Online publication date: 9-Aug-2024
https://doi.org/10.7554/eLife.86181
Gamallo-Fernandez PRama-Maneiro EVidal JLama M(2024)VERONA: A python library for benchmarking deep learning in business process monitoringSoftwareX10.1016/j.softx.2024.10173426(101734)Online publication date: May-2024
https://doi.org/10.1016/j.softx.2024.101734
Show More Cited By

Index Terms

Recommendations

A guided tour to approximate string matching

We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online ...
Dictionary matching: review of the aho-corasick algorithm and vision for large dictionaries
ICIST '18: Proceedings of the 8th International Conference on Information Systems and Technologies

Pattern-matching techniques have recently been applied to network security applications such as intrusion detection, virus protection, and spam filters. The widely used the Aho-Corasick algorithm can simultaneously match multiple patterns while ...
Text indexing with errors

In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be ...

Reviews

Reviewer: Jerzy W. Jaromczyk

Anyone who has ever searched for a word (pattern) in a file or checked documents for spelling errors appreciates efficient algorithms that can quickly find relevant occurrences of the pattern, even if it is inexact or incomplete. This paper surveys a class of such algorithms: indexing methods for approximate searching. The paper is divided into sections that discuss fundamental concepts and definitions, describe the methods and algorithms, and provide a comparative analysis of the experimental results. The bibliography lists more than 180 references, with publication times spanning decades, from the 1960s to 2010. In spite of the voluminous material, thanks to the taxonomy introduced by the author, the presentation is clear and well organized. A rich appendix includes additional material, such as proofs of selected theorems. Numerous tables and figures augment the presentation. Furthermore, the main observations and key points are concisely itemized as lists, such as the "bird view" (Boytsov's term) of key conclusions from the experimental analysis of the search methods. The paper thoroughly describes intensive experiments that include detailed investigations of results that, on the surface, may appear to be counterintuitive. For example, with FB-tries, retrieval times for longer patterns are better than for short ones; Boytsov conducted additional experiments to understand this phenomenon. This paper would be of interest not only to computer scientists (for whom numerous open questions are listed in the conclusion section), but also to researchers and practitioners from a variety of areas, who want to select the best search methods for such tasks as searching in dictionaries of DNA sequences. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Journal of Experimental Algorithmics

ACM Journal of Experimental Algorithmics Volume 16, Issue

2011

411 pages

ISSN:1084-6654

EISSN:1084-6654

DOI:10.1145/1963190

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 May 2011

Published in JEA Volume 16

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
3,740
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)11

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Deshmane AMartínez-de-Albéniz V(2025)Play it Again, Sam? Reference-Point Formation and Product Differentiation in the Music IndustryManagement Science10.1287/mnsc.2022.03319Online publication date: 27-Jan-2025
https://doi.org/10.1287/mnsc.2022.03319
Spisak NAthènes GDupic TMora TWalczak A(2024)Combining mutation and recombination statistics to infer clonal families in antibody repertoireseLife10.7554/eLife.8618113Online publication date: 9-Aug-2024
https://doi.org/10.7554/eLife.86181
Gamallo-Fernandez PRama-Maneiro EVidal JLama M(2024)VERONA: A python library for benchmarking deep learning in business process monitoringSoftwareX10.1016/j.softx.2024.10173426(101734)Online publication date: May-2024
https://doi.org/10.1016/j.softx.2024.101734
Hur BVerspoor KBaldwin THardefeldt LPfeiffer CMansfield CScarborough RGilkerson J(2024)Using natural language processing and patient journey clustering for temporal phenotyping of antimicrobial therapies for cat bite abscessesPreventive Veterinary Medicine10.1016/j.prevetmed.2023.106112223(106112)Online publication date: Feb-2024
https://doi.org/10.1016/j.prevetmed.2023.106112
Zellner LRauch SSontheim JSeidl T(2024)On Diverse and Precise Recommendations for Small and Medium-Sized EnterprisesAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2262-4_10(118-130)Online publication date: 25-Apr-2024
https://doi.org/10.1007/978-981-97-2262-4_10
Giordano VChiarello FMelluso NFantoni GBonaccorsi A(2023)Text and Dynamic Network Analysis for Measuring Technological Convergence: A Case Study on Defense Patent DataIEEE Transactions on Engineering Management10.1109/TEM.2021.307823170:4(1490-1503)Online publication date: Apr-2023
https://doi.org/10.1109/TEM.2021.3078231
Cahuantzi RChen XGüttel S(2023)A Comparison of LSTM and GRU Networks for Learning Symbolic SequencesIntelligent Computing10.1007/978-3-031-37963-5_53(771-785)Online publication date: 20-Aug-2023
https://doi.org/10.1007/978-3-031-37963-5_53
Gonthier C(2022)An easy way to improve scoring of memory span tasks: The edit distance, beyond “correct recall in the correct serial position”Behavior Research Methods10.3758/s13428-022-01908-255:4(2021-2036)Online publication date: 6-Jul-2022
https://doi.org/10.3758/s13428-022-01908-2
Wheeler DBoyle JBarsell DGlasgow TMcClernon FOliver JFuemmeler B(2022)Associations of Alcohol and Tobacco Retail Outlet Rates with Neighborhood DisadvantageInternational Journal of Environmental Research and Public Health10.3390/ijerph1903113419:3(1134)Online publication date: 20-Jan-2022
https://doi.org/10.3390/ijerph19031134
Petty THannig JHuszar TIyer H(2022)A New String Edit Distance and ApplicationsAlgorithms10.3390/a1507024215:7(242)Online publication date: 12-Jul-2022
https://doi.org/10.3390/a15070242
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

A guided tour to approximate string matching

Dictionary matching: review of the aho-corasick algorithm and vision for large dictionaries

Text indexing with errors

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations