Skip to main content

LanceDB backed datastore and retrievers for Haystack 2.X

Project description

test Documentation Status

LanceDB Haystack Document store

LanceDB-Haystack is an embedded LanceDB backed Document Store for Haystack 2.X.

Installation

The current simplest way to get LanceDB-Haystack is to install from GitHub via pip:

pip install lancedb-haystack

Usage

import pyarrow as pa
from lancedb_haystack import LanceDBDocumentStore
from lancedb_haystack import LanceDBEmbeddingRetriever, LanceDBFTSRetriever

# Declare the metadata fields schema, this lets us filter using it.
# See: https://arrow.apache.org/docs/python/api/datatypes.html
metadata_schema = pa.struct([
  ('title', pa.string()),    
  ('publication_date', pa.timestamp('s')),
  ('page_number', pa.int32()),
  ('topics', pa.list_(pa.string()))
])

# Create the DocumentStore
document_store = LanceDBDocumentStore(
  database='my_database', 
  table_name="documents", 
  metadata_schema=metadata_schema, 
  embedding_dims=384
)

# Create an embedding retriever
embedding_retriever = LanceDBEmbeddingRetriever(document_store)

# Create a Full Text Search retriever
fts_retriever = LanceDBFTSRetriever(document_store)

See also examples/pipeline-usage.ipynb for a full worked example.

Development

Test

You can use hatch to run the linters:

~$ hatch run lint:all
cmd [1] | ruff .
cmd [2] | black --check --diff .
All done! ✨ 🍰 ✨
6 files would be left unchanged.
cmd [3] | mypy --install-types --non-interactive src/lancedb_haystack tests
Success: no issues found in 6 source files

Similar for running the tests:

~$ hatch run cov
cmd [1] | coverage run -m pytest tests
...

Build

To build the package you can use hatch:

~$ hatch build
[sdist]
dist/lancedb_haystack-0.1.0.tar.gz

[wheel]
dist/lancedb_haystack-0.1.0-py3-none-any.whl

Document

To build the api docs run the following:

~$ cd docs
~$ make clean
~$ make build

Roadmap

In no particular order:

  • Figure out if it's possible to have LanceDB work with dynamic metadata

    Currently, this implementation is limited to having only metadata which is defined in the metadata_schema. It would be nice to be able to infer a schema from the first document to be added, or even better, be able to just have arbitrary metadata, rather than having to specify it all up front.

  • Expand the supported metadata types

    As noted the metadata section requires a pyarrow schema; not all of the types have been tested, and may not all be supported. It would be good to try out a few more to see if they're supported, and perhaps add those that aren't.

Limitations

The DocumentStore requires a pyarrow StructType to be specified as the schema for the metadata dict. This should cover all metadata fields which may appear in any of the documents you want to store.

Currently, the system supports the basic datatypes (ints, floats, bools, strings, etc.) as well as structs and lists.
Others may work, but haven't been tested.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lancedb_haystack-0.1.1.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

lancedb_haystack-0.1.1-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file lancedb_haystack-0.1.1.tar.gz.

File metadata

  • Download URL: lancedb_haystack-0.1.1.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.2

File hashes

Hashes for lancedb_haystack-0.1.1.tar.gz
Algorithm Hash digest
SHA256 59b1e74f7c3ba9960d7b1a3a78c6f4254a58137a0d7c303483843454dc56a13b
MD5 d57e65f1da05ef481d9037b0939b2e5f
BLAKE2b-256 c6f9dfb111419e8978e989a13b4280d67729ef838b44b8b8bfdf4428e9dbe7fb

See more details on using hashes here.

File details

Details for the file lancedb_haystack-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for lancedb_haystack-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 edb06de4cf8ecaa86f75f332a0b992dc2e6a97a3e063539c1f8ed887c0add9d3
MD5 2ebd99e01dcaf5347b1640cc83b0f7b1
BLAKE2b-256 032765c1cbc61935163531790d67488bbd7f38c5395f840ae32f2ea38ad1466e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page