retriv: A Blazing-Fast Python Search Engine.
Project description
⚡️ Introduction
retriv is a fast search engine implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to index and search your document collection and allows you to automatically tune the underling retrieval model, BM25.
✨ Features
Retrieval Models
retriv implements BM25 as a retrieval model. Alternatives will probably be added in the future.
Multi-search & Batch-search
In addition to the standard search functionality, retriv provides two additional search methods: msearch and bsearch.
- msearch allows computing the results for multiple queries at once, leveraging the automatic parallelization features offered by Numba.
- bsearch is similar to msearch but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is very useful for pre-computing BM25 results for hundred of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of Neural Models for Information Retrieval.
AutoTune
retriv offers an automatic tuning functionality that allows you to tune BM25's parameters with a single function call. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one.
Stemmers
Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:
- snowball (default)
The following languages are supported by Snowball Stemmer: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
To select your preferred language simply use<language>
. - arlstem (Arabic)
- arlstem2 (Arabic)
- cistem (German)
- isri (Arabic)
- krovetz (English)
- lancaster (English)
- porter (English)
- rslp (Portuguese)
Tokenizers
Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:
Stop-word Lists
retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.
Automatic Spell Correction
retriv provides automatic spell correction through Hunspell for 92 languages. Please, follow the link and choose your preferred language (e.g., Italian → "dictionary-it" → use "it"). For some languages you can directly pass their names: Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Portuguese, Romanian, Russian, Spanish, and Swedish.
NOTE: Automatic spell correction is disabled by default. It can introduce artifacts, degrading retrieval performances when documents are free from misspellings. If possible, check whether it can improve retrieval performances for your specific document collection.
🔌 Installation
pip install retriv
💡 Usage
Minimal Working Example
from retriv import SearchEngine
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
se = SearchEngine("new-index")
se.index(collection)
se.search("witches masses")
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
Create index from file
You can index a document collection from a JSONl, CSV, or TSV file.
CSV and TSV files must have a header.
File kind is automatically inferred.
Use the callback
parameter to pass a function for converting your documents in the format supported by retriv on the fly.
Indexes are automatically saved.
This is the preferred way of creating indexes as it has a low memory footprint.
from retriv import SearchEngine
se = SearchEngine("new-index")
se.index_file(
path="path/to/collection", # File kind is automatically inferred
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None
"id": doc["id"],
"text": doc["title"] + "\n" + doc["body"],
)
se = SearchEngine("new-index")
is equivalent to:
se = SearchEngine(
index_name="new-index", # Default value
min_df=1, # Min doc-frequency. Defaults to 1.
tokenizer="whitespace", # Default value
stemmer="english", # Default value (Snowball English)
stopwords="english", # Default value
spell_corrector=None, # Default value
do_lowercasing=True, # Default value
do_ampersand_normalization=True, # Default value
do_special_chars_normalization=True, # Default value
do_acronyms_normalization=True, # Default value
do_punctuation_removal=True, # Default value
)
Create index from list
collection = [
{"id": "doc_1", "title": "...", "body": "..."},
{"id": "doc_2", "title": "...", "body": "..."},
{"id": "doc_3", "title": "...", "body": "..."},
{"id": "doc_4", "title": "...", "body": "..."},
]
se = SearchEngine(...)
se.index(
collection,
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None
"id": doc["id"],
"text": doc["title"] + "\n" + doc["body"],
)
)
Load / Delete index
from retriv import SearchEngine
se = SearchEngine.load("index-name")
SearchEngine.delete("index-name")
Search
se.search(
query="witches masses",
return_docs=True, # Default value
cutoff=100, # Default value, number of results to return
)
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
Multi-Search
se.msearch(
queries=[{"id": "q_1", "text": "witches masses"}, ...],
cutoff=100, # Default value, number of results
)
Output:
{
"q_1": {
"doc_2": 1.7536403,
"doc_1": 0.6931472
},
...
}
AutoTune
Use the AutoTune function to tune BM25 parameters w.r.t. your document collection and queries.
All metrics supported by ranx are supported by the autotune
function.
se.autotune(
queries=[{ "q_id": "q_1", "text": "...", ... }], # Train queries
qrels=[{ "q_1": { "doc_1": 1, ... }, ... }], # Train qrels
metric="ndcg", # Default value, metric to maximize
n_trials=100, # Default value, number of trials
cutoff=100, # Default value, number of results
)
At the of the process, the best parameter configuration is automatically applied to the SearchEngine
instance and saved to disk.
You can see what the configuration is by printing se.hyperparams
.
Speed Comparison
We performed a speed test, comparing retriv to rank_bm25, a popular BM25 implementation in Python, and pyserini, a Python binding to the Lucene search engine.
We relied on the MSMARCO Passage dataset to collect documents and queries. Specifically, we used the original document collection and three sub-samples of it, accounting for 1k, 100k, and 1M documents, respectively, and sampled 1k queries from the original ones. We computed the top-100 results with each library (if possible). Results are reported below. Best results are highlighted in boldface.
Library | Collection Size | Elapsed Time | Avg. Query Time | Throughput (q/s) |
---|---|---|---|---|
rank_bm25 | 1,000 | 646ms | 6.5ms | 1548/s |
pyserini | 1,000 | 1,438ms | 1.4ms | 695/s |
retriv | 1,000 | 140ms | 0.1ms | 7143/s |
retriv (multi-search) | 1,000 | 134ms | 0.1ms | 7463/s |
rank_bm25 | 100,000 | 106,000ms | 1060ms | 1/s |
pyserini | 100,000 | 2,532ms | 2.5ms | 395/s |
retriv | 100,000 | 314ms | 0.3ms | 3185/s |
retriv (multi-search) | 100,000 | 256ms | 0.3ms | 3906/s |
rank_bm25 | 1,000,000 | N/A | N/A | N/A |
pyserini | 1,000,000 | 4,060ms | 4.1ms | 246/s |
retriv | 1,000,000 | 1,018ms | 1.0ms | 982/s |
retriv (multi-search) | 1,000,000 | 503ms | 0.5ms | 1988/s |
rank_bm25 | 8,841,823 | N/A | N/A | N/A |
pyserini | 8,841,823 | 12,245ms | 12.2ms | 82/s |
retriv | 8,841,823 | 10,763ms | 10.8ms | 93/s |
retriv (multi-search) | 8,841,823 | 5,644ms | 5.6ms | 177/s |
🎁 Feature Requests
Would you like to see other features implemented? Please, open a feature request.
🤘 Want to contribute?
Would you like to contribute? Please, drop me an e-mail.
📄 License
retriv is an open-sourced software licensed under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.