Skip to main content

retriv: A Blazing-Fast Python Search Engine.

Project description

PyPI version License: MIT

⚡️ Introduction

retriv is a fast search engine implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to index and search your document collection and allows you to automatically tune the underling retrieval model, BM25.

✨ Features

Stemmers

Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:

  • snowball (default)
    The following languages are supported by Snowball Stemmer: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish. To select your preferred language simply use <language> .
  • arlstem (Arabic)
  • arlstem2 (Arabic)
  • cistem (German)
  • isri (Arabic)
  • krovetz (English)
  • lancaster (English)
  • porter (English)
  • rslp (Portuguese)

Tokenizers

Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:

Stop-word Lists

retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.

Retrieval Models

  • BM25
  • More coming soon...

AutoTune

retriv supports an automatic tuning functionality that allows you to tune BM25's parameters with a single function call. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one.

🔌 Installation

pip install retriv

💡 Usage

Create index

from retriv import SearchEngine

collection = [
  {"id": "doc_1", "contents": "Generals gathered in their masses"},
  {"id": "doc_2", "contents": "Just like witches at black masses"},
  {"id": "doc_3", "contents": "Evil minds that plot destruction"},
  {"id": "doc_4", "contents": "Sorcerer of death's construction"},
]

se = SearchEngine(
  index_name="new-index",  # Default value
  min_term_freq=1,         # Default value
  tokenizer="whitespace",  # Default value
  stemmer="english",       # Default value (Snowball English stemmer)
  sw_list="english",       # Default value
)

se.index(
  collection=collection,
  show_progress=True,     # Default value
)

Alternatively, you can index a document collection from a JSONl, CSV, or TSV file. CSV and TSV files must have a header. Use the callback parameter to pass a function for converting your documents in the format supported by retriv.

se = SearchEngine("index-from-file")
se.index_file(
  path="path/to/collection",
  show_progress=True,     # Default value
  callback=None,          # Default value
)

Search

se.search(
  query="witches masses",
  return_docs=True,  # Default value
  b=0.75,            # Default value, BM25 parameter
  k1=1.2,            # Default value, BM25 parameter
  n_res=100,         # Default value, number of results
)

Output:

[
  {
    "id": "doc_2",
    "contents": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "contents": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

AutoTune

Use the AutoTune function to tune BM25 parameters w.r.t. your document collection and queries. All metrics supported by ranx are supported by the autotune function.

best_params = se.autotune(
    queries=[{ "q_id": "q_1", "text": "...", ... }],  # Train queries
    qrels=[{ "q_1": { "doc_1": 1, ... }, ... }],      # Train qrels
    metric="ndcg@100",  # Default value, metric to maximize
    n_trials=100,       # Default value, number of trials
    n_res=100,          # Default value, number of results
)

Search using the best parameter configuration:

results = se.search(query, **best_params)

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

retriv is an open-sourced software licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retriv-0.1.1.tar.gz (12.4 kB view hashes)

Uploaded Source

Built Distribution

retriv-0.1.1-py3-none-any.whl (11.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page