retriv

retriv: A Blazing-Fast Python Search Engine.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: General

Project description

⚡️ Introduction

retriv is a fast search engine implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to index and search your document collection and allows you to automatically tune the underling retrieval model, BM25.

✨ Features

Stemmers

Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:

snowball (default)
The following languages are supported by Snowball Stemmer: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish. To select your preferred language simply use <language> .
arlstem (Arabic)
arlstem2 (Arabic)
cistem (German)
isri (Arabic)
krovetz (English)
lancaster (English)
porter (English)
rslp (Portuguese)

Tokenizers

Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:

Stop-word Lists

retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.

Retrieval Models

BM25
More coming soon...

AutoTune

retriv supports an automatic tuning functionality that allows you to tune BM25's parameters with a single function call. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one.

🔌 Installation

pip install retriv

💡 Usage

Create index

from retriv import SearchEngine

collection = [
  {"id": "doc_1", "contents": "Generals gathered in their masses"},
  {"id": "doc_2", "contents": "Just like witches at black masses"},
  {"id": "doc_3", "contents": "Evil minds that plot destruction"},
  {"id": "doc_4", "contents": "Sorcerer of death's construction"},
]

se = SearchEngine(
  index_name="new-index",  # Default value
  min_term_freq=1,         # Default value
  tokenizer="whitespace",  # Default value
  stemmer="english",       # Default value (Snowball English stemmer)
  sw_list="english",       # Default value
)

se.index(
  collection=collection,
  show_progress=True,     # Default value
)

Alternatively, you can index a document collection from a JSONl, CSV, or TSV file. CSV and TSV files must have a header. Use the callback parameter to pass a function for converting your documents in the format supported by retriv.

se = SearchEngine("index-from-file")
se.index_file(
  path="path/to/collection",
  show_progress=True,     # Default value
  callback=None,          # Default value
)

Search

se.search(
  query="witches masses",
  return_docs=True,  # Default value
  b=0.75,            # Default value, BM25 parameter
  k1=1.2,            # Default value, BM25 parameter
  n_res=100,         # Default value, number of results
)

Output:

[
  {
    "id": "doc_2",
    "contents": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "contents": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

AutoTune

Use the AutoTune function to tune BM25 parameters w.r.t. your document collection and queries. All metrics supported by ranx are supported by the autotune function.

best_params = se.autotune(
    queries=[{ "q_id": "q_1", "text": "...", ... }],  # Train queries
    qrels=[{ "q_1": { "doc_1": 1, ... }, ... }],      # Train qrels
    metric="ndcg@100",  # Default value, metric to maximize
    n_trials=100,       # Default value, number of trials
    n_res=100,          # Default value, number of results
)

Search using the best parameter configuration:

results = se.search(query, **best_params)

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

retriv is an open-sourced software licensed under the MIT license.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: General

Release history Release notifications | RSS feed

0.2.3

Aug 24, 2023

0.2.2

Aug 23, 2023

0.2.1

May 16, 2023

0.2.0

Feb 19, 2023

0.1.5

Jan 26, 2023

0.1.4

Dec 3, 2022

0.1.3

Dec 3, 2022

0.1.2

Dec 3, 2022

This version

0.1.1

Nov 16, 2022

0.1.0

Nov 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retriv-0.1.1.tar.gz (12.4 kB view hashes)

Uploaded Nov 16, 2022 Source

Built Distribution

retriv-0.1.1-py3-none-any.whl (11.4 kB view hashes)

Uploaded Nov 16, 2022 Python 3

Hashes for retriv-0.1.1.tar.gz

Hashes for retriv-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a63f244b152ee7a6561c14ba4ae1fb6e49ca9fc0e709c72c11b939e1d7e4c6f0`
MD5	`982175e5b3f679aca4b62263ab1ba434`
BLAKE2b-256	`019b49877274e80e5edf490bcba9c01c267d177031ff4655232b105c933b648a`

Hashes for retriv-0.1.1-py3-none-any.whl

Hashes for retriv-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d89b90f0269075d2783d0c11a4dd0086889fa57c3c25ac0fe3d28f6fc3d1615`
MD5	`6ff0835164f7bdf3269870c81d4106fc`
BLAKE2b-256	`a3987557aadd4897e52483a52de060c85cef8adea5fa492af17813d801634cfa`