Skip to main content

Haystack 2.x In-memory Document Store with Enhanced Efficiency

Project description

test codecov code style - Black types - Mypy Python 3.9

Better BM25 In-Memory Document Store

An in-memory document store is a great starting point for prototyping and debugging before migrating to production-grade stores like Elasticsearch. However, the original implementation of BM25 retrieval recreates an inverse index for the entire document store on every new search. Furthermore, the tokenization method is primitive, only permitting splitters based on regular expressions, making localization and domain adaptation challenging. Therefore, this implementation is a slight upgrade to the default BM25 in-memory document store by implementing incremental index update and incorporation of SentencePiece statistical sub-word tokenization.

Installation

$ pip install bbm25-haystack

Alternatively, you can clone the repository and build from source to be able to reflect changes to the source code:

$ git clone https://github.com/Guest400123064/bbm25-haystack.git
$ cd bbm25-haystack
$ pip install -e .

Usage

The initializer takes three BM25+ hyperparameters, namely k1, b, and delta, one path to a trained SentencePiece tokenizer .model file, and a filtering logic flag (see below). All parameters are optional. The default tokenizer is directly copied from LLaMA-2-7B-32K tokenizer with a vocab size of 32,000.

from haystack import Document
from bbm25_haystack import BetterBM25DocumentStore, BetterBM25Retriever


document_store = BetterBM25DocumentStore()
document_store.write_documents([
   Document(content="There are over 7,000 languages spoken around the world today."),
   Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
   Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")
])

retriever = BetterBM25Retriever(document_store)
retriever.run(query="How many languages are spoken around the world today?")

Filtering Logic

The current document store uses document_matches_filter shipped with Haystack to perform filtering by default, which is the same as InMemoryDocumentStore except that it is DOES NOT support legacy operator names.

However, there is also an alternative filtering logic shipped with this implementation that is more conservative (and unstable at this point). To use this alternative logic, initialize the document store with haystack_filter_logic=False Please find comments and implementation details in filters.py. TL;DR:

  • Comparison with None, i.e., missing values, involved will always return False, no matter the document attribute value or filter value.
  • Comparison with DataFrame is always prohibited to reduce surprises.
  • No implicit datetime conversion from string values.

These differences lead to a few caveats. Firstly, some test cases are overridden to take into account the different expectations. However, this means that passed, non-overridden tests may not be faithful, i.e., the filters behave in the same way as the old implementation while different behaviors are expected. Further, the negation logic needs to be considered again because False can now issue from both input check and the actual comparisons. But I think having input processing and comparisons separated makes the filtering behavior more transparent.

Search Quality Evaluation

This repo has a simple script to help evaluate the search quality over BEIR benchmark. You need to clone the repository (you can also manually download the script and place it under a folder named scripts) and you have to install additional dependencies to run the script.

$ pip install beir

To run the script, you may want to specify the dataset name and BM25 hyperparameters. For example:

$ python scripts/benchmark_beir.py --datasets scifact arguana --bm25-k1 1.2 --output eval.csv

It automatically downloads the benchmarking dataset to benchmarks/beir, where benchmarks is at the same level as scripts. You may also check the help page for more information.

$ python scripts/benchmark_beir.py --help

New benchmarking scripts are expected to be added in the future.

License

bbm25-haystack is distributed under the terms of the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbm25_haystack-0.1.2.tar.gz (248.3 kB view details)

Uploaded Source

Built Distribution

bbm25_haystack-0.1.2-py3-none-any.whl (240.0 kB view details)

Uploaded Python 3

File details

Details for the file bbm25_haystack-0.1.2.tar.gz.

File metadata

  • Download URL: bbm25_haystack-0.1.2.tar.gz
  • Upload date:
  • Size: 248.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for bbm25_haystack-0.1.2.tar.gz
Algorithm Hash digest
SHA256 92cabaa971f11b04fb5f0dbce0827d11322ed9701ada7c7ceba89fdb35f6f7a7
MD5 f951e03b79db85d8d411adbcd821ea6f
BLAKE2b-256 b422ccc89416691ca34aebaaafa5f985fc863e26286dae1de6d4e03e2f669b63

See more details on using hashes here.

File details

Details for the file bbm25_haystack-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for bbm25_haystack-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f28bc2a4c2c64a1621b1f907bfb53dc4d5f85c93920bc45f343c97aa6f081039
MD5 6b224b67ae402b7b1658b141fa432f81
BLAKE2b-256 33e7d920a8faf7ea6fc5b32c4e4c94829b023e75419a3d046f2b028df40ff7b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page