Skip to main content

Baguetter is a flexible and efficient search engine library implemented in Python. It supports sparse (traditional), dense (semantic), and hybrid retrieval methods.

Project description

Baguetter

Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, implementing, and testing new search methods. Baguetter supports sparse (traditional), dense (semantic), and hybrid retrieval methods.

Note: Baguetter is not built for production use-cases or scale. For such use-cases, please check out other search engine projects.

Paper: https://arxiv.org/abs/2408.06643

Features

  • Sparse retrieval using BM25 and BMX algorithms
  • Dense retrieval using embeddings
  • Hybrid retrieval combining sparse and dense methods
  • Customizable text preprocessing pipeline
  • Multi-threaded indexing and searching
  • Evaluation tools for benchmarking
  • Easy integration with HuggingFace datasets and models for sharing
  • Hackable interface to quickly implement new methods

Installation

pip install baguetter

Quick Start

from baguetter.indices import BMXSparseIndex

# Create an index
idx = BMXSparseIndex()

# Add documents
docs = [
  "We all love baguette and cheese",
  "Baguette is a great bread",
  "Cheese is a great source of protein",
  "Baguette is a great source of carbs",
]
doc_ids = ["1", "2", "3", "4"]

idx.add_many(doc_ids, docs, show_progress=True)

# Search
results = idx.search("quick fox")
print(results)

# Search many
results = idx.search_many(["quick fox", "baguette is great"])
print(results)

Evaluation

Baguetter includes tools for evaluating search performance on standard benchmarks:

from baguetter.evaluation import datasets, evaluate_retrievers
from baguetter.indices import BM25SparseIndex, BMXSparseIndex

results = evaluate_retrievers(datasets.mteb_datasets_small, {"bm25": BM25SparseIndex, "bmx": BMXSparseIndex})
results.save("eval_results")

Documentation

For more detailed usage instructions and API documentation, please refer to the full documentation.

Contributing

Contributions are welcome! We are using the GitHub Pull Request workflow. Either open an issue first and create a PR or include a comprehensive commit message when opening a PR.

To get started, please create a clone of the repo (or a fork). We recommend working in a virtual environment.

python -m pip install -e ".[dev]"

pre-commit install

To test your changes, run:

pytest

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgements

Baguetter builds upon the work of several open-source projects:

  1. retriv by AmenRa: Baguetter is a fork of retriv, adjusting it to our needs.

  2. bm25s by xhluca: Our BM25 implementation is based on this project, which provides an efficient and effective implementation of the BM25 algorithm with different scoring functions.

  3. USearch by unum-cloud for dense retrival.

  4. ranx by AmenRa for evaluation.

Please check out the respective repositories and show some appreciation to the authors.

Citing

@article{li2024bmx,
      title={BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search},
      author={Xianming Li and Julius Lipp and Aamir Shakir and Rui Huang and Jing Li},
      year={2024},
      eprint={2408.06643},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2408.06643},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

baguetter-0.1.1.tar.gz (55.6 kB view details)

Uploaded Source

Built Distribution

baguetter-0.1.1-py3-none-any.whl (72.9 kB view details)

Uploaded Python 3

File details

Details for the file baguetter-0.1.1.tar.gz.

File metadata

  • Download URL: baguetter-0.1.1.tar.gz
  • Upload date:
  • Size: 55.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for baguetter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9639f6e4f1e840185f55bd7b9313116cfc09a2995dd8473b1c98bb2be47a2660
MD5 3fa2fa14909860fc2c142cc99fad24d2
BLAKE2b-256 2ec8d2ab877ef2b88abe53c664e19a10821f1103eca2a16b4d40c41cc6f00852

See more details on using hashes here.

File details

Details for the file baguetter-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: baguetter-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 72.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for baguetter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 33f6a8b891571c79efd2e153aa3fc20f67e341d3ec9375954315fd9acc4fd901
MD5 9998addfdc9ed835d9de58bd9e232823
BLAKE2b-256 559e27b3f8c9b686351a6558986628b5f6cd352cc125af771a5bde5804d4bf22

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page