skip to main content
article
Free access

Optimal signature extraction and information loss

Published: 01 September 1987 Publication History

Abstract

Signature files seem to be a promising access method for text and attributes. According to this method, the documents (or records) are stored sequentially in one file ("text file"), while abstractions of the documents ("signatures") are stored sequentially in another file ("signature file"). In order to resolve a query, the signature file is scanned first, and many nonqualifying documents are immediately rejected. We develop a framework that includes primary key hashing, multiattribute hashing, and signature files. Our effort is to find the optimal signature extraction method.
The main contribution of this paper is that we present optimal and efficient suboptimal algorithms for assigning words to signatures in several environments. Another contribution is that we use information theory, and study the relationship of the false drop probability Fd and the information that is lost during signature extraction. We give tight lower bounds on the achievable Fd and show that a simple relationship holds between the two quantities in the case of optimal signature extraction with uniform occurrence and query frequencies. We examine hashing as a method to map words to signatures (instead of the optimal way), and show that the same relationship holds between Fd and loss, indicating that an invariant may exist between these two quantities for every signature extraction method.

References

[1]
AHO, A. V., AND ULLMAN, J.D. Optimal partial-match retrieval when fields are independently specified. ACM Trans. Database Syst. 4, 2 (June 1979), 168-179.
[2]
CARTER, L. J., AND WEGMAN, M.L. Universal classes of hash functions. J. Comput. Syst. Sci. 18 (1979), 143-154.
[3]
CHANG, C.C. The study of an ordered minimal perfect hashing scheme. Commun. A CM 27, 4 (Apr. 1984), 384-387.
[4]
CHRISTODOULAKIS, S., AND FALOUTSOS, C. Design considerations for a message file server. IEEE Trans. So{tw. Eng. SE-IO, 2 (Mar. 1984), 201-210.
[5]
FALOUTSOS, C. Signature files: Design and performance comparison of some signature extraction methods. In Proceedings of ACM SIGMOD (Austin, Tex., May 28-31, 1985), ACM, New York, 1985, pp. 63-82.
[6]
FALOUTSOS, C., AND CHRISTODOULAKIS, S. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Off. Inf. Syst. 2, 4 (Oct. 1984), 267-288.
[7]
FILES, J. R., AND HUSKEY, H.D. An information retrieval system based on superimposed coding. In Proceedings of AFIPS Fall Joint Computer Conference, vol. 35 (Las Vegas, Nev., Nov. 18-20, 1969). AFIPS Press, Reston, Va., 1969, pp. 423-432.
[8]
GALLAGER, R.G. Information Theory and Reliable Communication. Wiley, New York, 1968.
[9]
GONNET, G.H. Unstructured data bases. Tech. Rep. CS-82-09, Dept. of Computer Sciences, Univ. of Waterloo, Ontario, 1982.
[10]
GUSTAFSON, R.A. Elements of the randomized combinatorial file structure. In ACM SIGIR, Proceedings of the Symposium on Information Storage and Retrieval (College Park, Md., Apr. 1971). ACM, New York, 1971, pp. 163-174.
[11]
HARDY, G. H., LITTLEWOOD, J. E., AND POLYA, G. Inequalities. 2nd ed. Cambridge University Press, New York, 1952.
[12]
HARRISON, M. C. Implementation of the substring test by hashing. Commun. ACM 14, 12 (Dec. 1971), 777-779.
[13]
HUFFMAN, D.A. A method for the construction of minimum redundancy codes. In Proceedings of IRE, vol. 40, 1962, pp. 1098-1101.
[14]
JAESCHKE, G. Reciprocal hashing: A method for generating minimal perfect hashing functions. Commun. ACM 24, 12 (Dec. 1981), 829-833.
[15]
LARSON, P.A. A method for speeding up text retrieval. Unpublished manuscript, Dept. of Computer Sciences, Univ. of Waterloo, Ontario, 1983.
[16]
LLOYD, J.W. Optimal partial-match retrieval. BIT 20 (1980), 406-413.
[17]
LLOYD, J. W., AND RAMAMOHANARAO, K. Partial-match retrieval for dynamic files. BIT 22 (1982), 150-168.
[18]
MACKAY, D.M. Information, Mechanism and Meaning. MIT Press, Cambridge, Mass., 1969.
[19]
MARSHALL, A. W., AND OLKIN, I. Inequalities: Theory of Majorization and Its Applications. Academic Press, New York, 1979.
[20]
MCILROY, M.D. Development of a spelling list. IEEE Trans. Commun. COM-30, 1 (Jan. 1982), 91-99.
[21]
MOOERS, C. Application of random codes to the gathering of statistical information. Bull. 31, Zator Co., Cambridge, Mass., 1949 (based on M.S. thesis, MIT, Cambridge, Mass., 1948).
[22]
PAPOULIS, A. Probability, Random Variables and Stochastic Processes. McGraw-Hill, New York, 1965.
[23]
PFALTZ, J. L., BERMAN, W. J., AND CAGLEY, E. M. Partial-match retrieval using indexed descriptor files. Commun. ACM 23, 9 (Sept. 1980), 522-528.
[24]
REINGOLD, E. M., NIEVERGELT, J., AND DEO, N. Combinatorial Algorithms: Theory and Practice. Prentice-Hall, Englewood Cliffs, N.J., 1977.
[25]
RIVEST, R.L. Partial match retrieval algorithms. SIAM J. Comput. 5, 1 (Mar. 1976), 19-50.
[26]
ROBERTS, C.S. Partial-match retrieval via the method of superimposed codes. Proc. IEEE 67, 12 (Dec. 1979), 1624-1642.
[27]
ROTHNIE, J. B., JR., AND LOZANO, T. Attribute based file organization in a paged memory environment. Commun. ACM 17, 2 (Feb. 1974), 63-69.
[28]
SACKS-DAVIS, R., AND RAMAMOHANARAO, K. A two level superimposed coding scheme for partial match retrieval. Inf. Syst. 8, 4 (1983), 273-280.
[29]
SEVERANCE, D. G., AND LOHMAN, G.M. Differential files: Their application to the maintenance of large databases. ACM Trans. Database Syst. 1, 3 (Sept. 1976), 256-267.
[30]
SPRUGNOLI, R. Perfect hashing functions: A single probe retrieving method for static sets. Commun. ACM 20, 11 (Nov. 1977), 841-850.
[31]
TSICHRITZIS, D., AND CHRISTODOULAKIS, S. Message files. ACM Trans. Off. Inf. Syst. 1, 1 (Jan. 1983), 88-98.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 12, Issue 3
Sept. 1987
199 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/27629
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1987
Published in TODS Volume 12, Issue 3

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)90
  • Downloads (Last 6 weeks)13
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media