Skip to main content

retriv: A Python Search Engine for Humans.

Project description

PyPI version License: MIT

🔥 News

  • [August 23, 2023] retriv 0.2.2 is out!
    This release adds experimental support for multi-field documents and filters. Please, refer to Advanced Retriever documentation.

  • [February 18, 2023] retriv 0.2.0 is out!
    This release adds support for Dense and Hybrid Retrieval. Dense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by retriv or imported from other sources. Hybrid Retrieval mix traditional retrieval, informally called Sparse Retrieval, and Dense Retrieval results to further improve retrieval effectiveness. As the library was almost completely redone, indices built with previous versions are no longer supported.

⚡️ Introduction

retriv is a user-friendly and efficient search engine implemented in Python supporting Sparse (traditional search with BM25, TF-IDF), Dense (semantic search) and Hybrid retrieval (a mix of Sparse and Dense Retrieval). It allows you to build a search engine in a single line of code.

retriv is built upon Numba for high-speed vector operations and automatic parallelization, PyTorch and Transformers for easy access and usage of Transformer-based Language Models, and Faiss for approximate nearest neighbor search. In addition, it provides automatic tuning functionalities to allow you to tune its internal components with minimal intervention.

✨ Main Features

Retrievers

Unified Search Interface

All the supported retrievers share the same search interface:

  • search: standard search functionality, what you expect by a search engine.
  • msearch: computes the results for multiple queries at once. It leverages automatic parallelization whenever possible.
  • bsearch: similar to msearch but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of Neural Models for Information Retrieval.

AutoTune

retriv automatically tunes Faiss configuration for approximate nearest neighbors search by leveraging AutoFaiss to guarantee 10ms response time based on your available hardware. Moreover, it offers an automatic tuning functionality for BM25's parameters, which require minimal user intervention. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one. Finally, it can automatically balance the importance of lexical and semantic relevance scores computed by the Hybrid Retriever to maximize retrieval effectiveness.

📚 Documentation

🔌 Requirements

python>=3.8

💾 Installation

pip install retriv

💡 Minimal Working Example

# Note: SearchEngine is an alias for the SparseRetriever
from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index").index(collection)

se.search("witches masses")

Output:

[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

retriv is an open-sourced software licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retriv-0.2.2.tar.gz (34.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

retriv-0.2.2-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file retriv-0.2.2.tar.gz.

File metadata

  • Download URL: retriv-0.2.2.tar.gz
  • Upload date:
  • Size: 34.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for retriv-0.2.2.tar.gz
Algorithm Hash digest
SHA256 a5608b58718162ecde0b78fa85f85c9d30d3ae48e85efe3808b5481f9d95adfc
MD5 63ef5a8f20e0d50dfdea66ad280b13c8
BLAKE2b-256 d5b19dcee8b6a72962776802ce73a48382e105934cb3d033f2cd701f182fe43b

See more details on using hashes here.

File details

Details for the file retriv-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: retriv-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for retriv-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 940f2d8d4509037bf484e420215bc8b4f820f232d9b04c687f71d6a146f07882
MD5 7228098111d5dc185570fcc27d64b011
BLAKE2b-256 05cda98f940bb65cbca79cba8b572280e61f4021b153465d2a77b131dc47a139

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page