retriv: A Blazing-Fast Python Search Engine.
Project description
⚡️ Introduction
retriv is a fast search engine implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to index and search your document collection and allows you to automatically tune the underling retrieval model, BM25.
✨ Features
Stemmers
Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:
- snowball (default)
The following languages are supported by Snowball Stemmer: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish. To select your preferred language simply use<language>
. - arlstem (Arabic)
- arlstem2 (Arabic)
- cistem (German)
- isri (Arabic)
- krovetz (English)
- lancaster (English)
- porter (English)
- rslp (Portuguese)
Tokenizers
Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:
Stop-word Lists
retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.
Retrieval Models
- BM25
- More coming soon...
AutoTune
retriv supports an automatic tuning functionality that allows you to tune BM25's parameters with a single function call. Under the hood, retriv leverage Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one.
🔌 Installation
pip install retriv
💡 Usage
Create index
from retriv import SearchEngine
collection = [
{"id": "doc_1", "contents": "Generals gathered in their masses"},
{"id": "doc_2", "contents": "Just like witches at black masses"},
{"id": "doc_3", "contents": "Evil minds that plot destruction"},
{"id": "doc_4", "contents": "Sorcerer of death's construction"},
]
se = SearchEngine(
index_name="new-index", # Default value
min_term_freq=1, # Default value
tokenizer="whitespace", # Default value
stemmer="english", # Default value (Snowball English stemmer)
sw_list="english", # Default value
)
se.index(
collection=collection,
show_progress=True, # Default value
)
Alternatively, you can index a document collection from a JSONl, CSV, or TSV file.
CSV and TSV files must have a header.
Use the callback
parameter to pass a function for converting your documents in the format supported by retriv.
se = SearchEngine("index-from-file")
se.index_file(
path="path/to/collection",
show_progress=True, # Default value
callback=None, # Default value
)
Search
se.search(
query="witches masses",
return_docs=True, # Default value
b=0.75, # Default value, BM25 parameter
k1=1.2, # Default value, BM25 parameter
n_res=100, # Default value, number of results
)
Output:
[
{
"id": "doc_2",
"contents": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"contents": "Generals gathered in their masses",
"score": 0.6931472
}
]
AutoTune
Use the AutoTune function to tune BM25 parameters w.r.t. your document collection and queries.
All metrics supported by ranx are supported by the autotune
function.
best_params = se.autotune(
queries=[{ "q_id": "q_1", "text": "...", ... }], # Train queries
qrels=[{ "q_1": { "doc_1": 1, ... }, ... }], # Train qrels
metric="ndcg@100", # Default value, metric to maximize
n_trials=100, # Default value, number of trials
n_res=100, # Default value, number of results
)
Search using the best parameter configuration:
results = se.search(query, **best_params)
🎁 Feature Requests
Would you like to see other features implemented? Please, open a feature request.
🤘 Want to contribute?
Would you like to contribute? Please, drop me an e-mail.
📄 License
retriv is an open-sourced software licensed under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.