Skip to main content

Haystack 2.x In-memory Document Store with Enhanced Efficiency

Project description

test codecov code style - Black types - Mypy Python 3.9

Better BM25 In-Memory Document Store

An in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, the original implementation of BM25 retrieval recreates an inverse index for the entire document store on every new search. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of SentencePiece statistical sub-word tokenization.

Installation

This package has not yet been published to PyPI. Please install the package directly from the main branch using:

pip install git+https://github.com/Guest400123064/bbm25-haystack.git@main

Usage

The initializer takes three BM25+ hyperparameters, namely k1, b, and delta, and one path to a trained SentencePiece tokenizer .model file. All parameters are optional. The default tokenizer is directly copied from this SentencePiece test tokenizer with a vocab size of 1000.

from haystack import Document
from bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever


document_store = BetterBM25DocumentStore()
document_store.write_documents([
   Document(content="There are over 7,000 languages spoken around the world today."),
   Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
   Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")
])

retriever = BetterBM25Retriever(document_store)
retriever.run(query="How many languages are spoken around the world today?")

Filtering Logic and Caveats

The filtering logic is slightly different from the default implementation shipped with Haystack, but this logic may be subject to changes, and I am open to different suggestions. Please find comments and implementation details in filters.py. TL;DR:

  • Comparison with None, i.e., missing values, involved will always return False, no matter the document attribute value or filter value.
  • Comparison with DataFrame is always prohibited to reduce surprises.
  • No implicit datetime conversion from string values.

These differences lead to a few caveats. Firstly, some test cases are overridden to take into account the different expectations. However, this means that passed, non-overridden tests may not be faithful, i.e., the filters behave in the same way as the old implementation while different behaviors are expected. Further, the negation logic needs to be considered again because False can now issue from both input check and the actual comparisons. But I think having input processing and comparisons separated makes the filtering behavior more transparent.

License

bbm25-haystack is distributed under the terms of the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbm25_haystack-0.1.0.tar.gz (166.2 kB view details)

Uploaded Source

Built Distribution

bbm25_haystack-0.1.0-py3-none-any.whl (161.3 kB view details)

Uploaded Python 3

File details

Details for the file bbm25_haystack-0.1.0.tar.gz.

File metadata

  • Download URL: bbm25_haystack-0.1.0.tar.gz
  • Upload date:
  • Size: 166.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for bbm25_haystack-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3dbdedcfdb1f068caa746c2a4bc238329204878d5500ab925b77ed7945ae9da3
MD5 5fd749a1ff7fb0342aecd47acd8dd6d8
BLAKE2b-256 6418c4450930325d4bc46168c1bd5986bb338bf422aba562b21ea020e26a461d

See more details on using hashes here.

File details

Details for the file bbm25_haystack-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bbm25_haystack-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8def9022ec614bd8ce1d71138c093ae05da75226800e6381adb09ffa3ee6247
MD5 963376328549752afc7e9e8c4101bcc3
BLAKE2b-256 00cb725f023cf0cf1dd0018b7e5e325079b129aa77e80fedd175ede2143494fd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page