A probabilistic approach to automatic keyword indexing. Part I. On the distribution of specialty words in a technical literature

SP Harter - Journal of the american society for information …, 1975 - Wiley Online Library
SP Harter
Journal of the american society for information science, 1975Wiley Online Library
The problem studied in this research is that of developing a set of formal statistical rules for
the purpose of identifying the keywords of a document‐words likely to be useful as index
terms for that document. The research was prompted by the observation, made by a number
of writers, that non‐specialty words, words which possess little value for indexing purposes,
tend to be distributed at random in a collection of documents. In contrast, specialty words are
not so distributed. In Part I of the study, a mixture of two Poisson distributions is examined in …
Abstract
The problem studied in this research is that of developing a set of formal statistical rules for the purpose of identifying the keywords of a document‐words likely to be useful as index terms for that document. The research was prompted by the observation, made by a number of writers, that non‐specialty words, words which possess little value for indexing purposes, tend to be distributed at random in a collection of documents. In contrast, specialty words are not so distributed.
In Part I of the study, a mixture of two Poisson distributions is examined in detail as a model of specialty word distribution, and formulas expressing the three parameters of the model in terms of empirical frequency statistics are derived. The fit of the model is tested on an experimental document collection and found to be acceptable for the purposes of the study. A measure intended to identify specialty words, consistent with the 2‐Poisson model, is proposed and evaluated.
Wiley Online Library