The IndexReader
class provides methods for accessing and manipulating an inverted index.
IMPORTANT NOTE: Be aware whether a method takes or returns analyzed or unanalyzed terms.
"Analysis" refers to processing by a Lucene Analyzer
, which typically includes tokenization, stemming, stopword removal, etc.
For example, if a method expects the unanalyzed term and is called with an analyzed term, it'll reanalyze the term; it is sometimes the case that analysis of an already analyzed term is also a valid term, which means that the method will return incorrect results without triggering any warning or error.
Initialize the class as follows:
from pyserini.index.lucene import IndexReader
# Initialize from a pre-built index:
index_reader = IndexReader.from_prebuilt_index('robust04')
# Initialize from an index path:
index_reader = IndexReader('indexes/index-robust04-20191213/')
Use terms()
to grab an iterator over all terms in the collection, i.e., the dictionary.
Note that these terms are analyzed.
Here, we only print out the first 10:
import itertools
for term in itertools.islice(index_reader.terms(), 10):
print(f'{term.term} (df={term.df}, cf={term.cf})')
How to fetch term statistics for a particular (unanalyzed) query term, "cities" in this case:
term = 'cities'
# Look up its document frequency (df) and collection frequency (cf).
# Note, we use the unanalyzed form:
df, cf = index_reader.get_term_counts(term)
print(f'term "{term}": df={df}, cf={cf}')
What if we want to fetch term statistics for an analyzed term?
This can be accomplished by setting Analyzer
to None
:
term = 'cities'
# Analyze the term.
analyzed = index_reader.analyze(term)
print(f'The analyzed form of "{term}" is "{analyzed[0]}"')
# Skip term analysis:
df, cf = index_reader.get_term_counts(analyzed[0], analyzer=None)
print(f'term "{term}": df={df}, cf={cf}')
Here's how to fetch and traverse postings:
# Fetch and traverse postings for an unanalyzed term:
postings_list = index_reader.get_postings_list(term)
for posting in postings_list:
print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')
# Fetch and traverse postings for an analyzed term:
postings_list = index_reader.get_postings_list(analyzed[0], analyzer=None)
for posting in postings_list:
print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')
Here's how to fetch the document vector for a document:
doc_vector = index_reader.get_document_vector('FBIS4-67701')
print(doc_vector)
The result is a dictionary where the keys are the analyzed terms and the values are the term frequencies.
If you want to know the positions of each term in the document, you can use get_term_positions
:
term_positions = index_reader.get_term_positions('FBIS4-67701')
print(term_positions)
The result is a dictionary where the keys are the analyzed terms and the values are the positions every term occur in the document.
If you want to reconstruct the document using the position information, you can do this:
doc = []
for term, positions in term_positions.items():
for p in positions:
doc.append((term,p))
doc = ' '.join([t for t, p in sorted(doc, key=lambda x: x[1])])
print(doc)
The reconstructed document contains analyzed terms while doc.contents() contains unanalyzed terms.
Building on the instructions above, to compute the tf-idf representation of a document, do something like this:
tf = index_reader.get_document_vector('FBIS4-67701')
df = {term: (index_reader.get_term_counts(term, analyzer=None))[0] for term in tf.keys()}
The two dictionaries will hold tf and df statistics; from those it is easy to assemble into the tf-idf representation. However, often the BM25 score is better than tf-idf. To compute the BM25 score for a particular term in a document:
# Note that the keys of get_document_vector() are already analyzed, we set analyzer to be None.
bm25_score = index_reader.compute_bm25_term_weight('FBIS4-67701', 'citi', analyzer=None)
print(bm25_score)
# Alternatively, we pass in the unanalyzed term:
bm25_score = index_reader.compute_bm25_term_weight('FBIS4-67701', 'city')
print(bm25_score)
And so, to compute the BM25 vector of a document:
tf = index_reader.get_document_vector('FBIS4-67701')
bm25_vector = {term: index_reader.compute_bm25_term_weight('FBIS4-67701', term, analyzer=None) for term in tf.keys()}
Another useful feature is to compute the score of a specific document with respect to a query, with the compute_query_document_score
method.
For example:
query = 'hubble space telescope'
docids = ['LA071090-0047', 'FT934-5418', 'FT921-7107', 'LA052890-0021', 'LA070990-0052']
for i in range(0, len(docids)):
score = index_reader.compute_query_document_score(docids[i], query)
print(f'{i+1:2} {docids[i]:15} {score:.5f}')
The scores should be very close (rounding at the 4th decimal point) to the results above, but not exactly the same because search
performs additional score manipulation to break ties during ranking.
Simple!
index_reader.stats()
Output is something like this:
{'total_terms': 174540872,
'documents': 528030,
'non_empty_documents': 528030,
'unique_terms': 923436}
Note that unless the underlying index was built with the -optimize
option (i.e., merging all index segments into a single segment), unique_terms
will show -1.
Nope, that's not a bug.
Here's how to dump out all the document vectors with BM25 weights in Pyserini's JSONL vector format:
# You must specify a file path for the .jsonl file
index_reader.dump_documents_BM25('collections/cacm_documents_bm25_dump.jsonl')
Output in the .jsonl file is something like this:
{"id": "CACM-0001", "vector": {"22": 1.2635996341705322, "perli": 2.813838481903076, "28": 1.4853038787841797, "ca581203": 3.889439582824707, "languag": 1.0462608337402344, "algebra": 1.9220843315124512, "preliminari": 2.5628812313079834, "3184": 1.5415544509887695, "196": 2.208385944366455, "210": 1.8753266334533691, "398": 1.9435224533081055, "410": 1.9893245697021484, "214": 2.6431477069854736, "91": 2.813838481903076, "decemb": 1.0904579162597656, "1958": 2.217474937438965, "1978": 0.03820383548736572, "53": 2.0858259201049805, "intern": 2.1584203243255615, "cacm": 0.00023746490478515625, "samelson": 3.230319023132324, "1273": 2.6431477069854736, "j": 0.6906000375747681, "k": 1.413696527481079, "march": 0.3245110511779785, "164": 2.774796485900879, "165": 3.0729784965515137, "1": 1.9030036926269531, "100": 2.3613317012786865, "123": 1.7414944171905518, "642": 2.0955820083618164, "1883": 2.7047135829925537, "1982": 2.1930222511291504, "324": 2.614957094192505, "5": 0.00014519691467285156, "6": 0.8225016593933105, "205": 1.8882520198822021, "8": 1.1494452953338623, "jb": 0.033278584480285645, "report": 1.7513933181762695, "669": 1.8384160995483398, "pm": 0.18731093406677246, "43": 1.9893245697021484}}
{"id": "CACM-0002", "vector": {"22": 1.5182371139526367, "cacm": 0.0002853870391845703, "sugai": 4.673230171203613, "29": 2.147885799407959, "subtract": 3.3808765411376953, "ca581202": 4.673230171203613, "i": 1.7500755786895752, "march": 0.3899056911468506, "comput": 0.7604131698608398, "2": 1.6285443305969238, "extract": 3.0503158569335938, "5": 0.0001285076141357422, "repeat": 3.487149238586426, "root": 2.429866313934326, "8": 1.3810787200927734, "jb": 0.039984822273254395, "decemb": 1.310204029083252, "1958": 2.664334774017334, "pm": 0.22505736351013184, "1978": 0.04590260982513428, "digit": 1.9418766498565674}}
...
Given vectors of weights in Pyserini's JSONL vector format, the weights can be quantized as below:
dump_file_path = 'collections/cacm_documents_bm25_dump.jsonl'
quantized_file_path = 'collections/cacm_documents_bm25_dump_quantized.jsonl'
index_reader.dump_documents_BM25(dump_file_path)
index_reader.quantize_weights(dump_file_path, quantized_file_path)
Output in the .jsonl file for the quantized weight vectors is something like this:
{"id": "CACM-0001", "vector": {"22": 47, "perli": 104, "28": 55, "ca581203": 143, "languag": 39, "algebra": 71, "preliminari": 95, "3184": 57, "196": 82, "210": 69, "398": 72, "410": 74, "214": 98, "91": 104, "decemb": 41, "1958": 82, "1978": 2, "53": 77, "intern": 80, "cacm": 1, "samelson": 119, "1273": 98, "j": 26, "k": 52, "march": 12, "164": 102, "165": 113, "1": 70, "100": 87, "123": 64, "642": 77, "1883": 100, "1982": 81, "324": 96, "5": 1, "6": 31, "205": 70, "8": 43, "jb": 2, "report": 65, "669": 68, "pm": 7, "43": 74}}
{"id": "CACM-0002", "vector": {"22": 56, "cacm": 1, "sugai": 172, "29": 79, "subtract": 125, "ca581202": 172, "i": 65, "march": 15, "comput": 28, "2": 60, "extract": 112, "5": 1, "repeat": 129, "root": 90, "8": 51, "jb": 2, "decemb": 49, "1958": 98, "pm": 9, "1978": 2, "digit": 72}}
...