Baguetter is a flexible and efficient search engine library implemented in Python. It supports sparse (traditional), dense (semantic), and hybrid retrieval methods.
Project description
Baguetter
Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, implementing, and testing new search methods. Baguetter supports sparse (traditional), dense (semantic), and hybrid retrieval methods.
Note: Baguetter is not built for production use-cases or scale. For such use-cases, please check out other search engine projects.
Paper: https://arxiv.org/abs/2408.06643
Features
- Sparse retrieval using BM25 and BMX algorithms
- Dense retrieval using embeddings
- Hybrid retrieval combining sparse and dense methods
- Customizable text preprocessing pipeline
- Multi-threaded indexing and searching
- Evaluation tools for benchmarking
- Easy integration with HuggingFace datasets and models for sharing
- Hackable interface to quickly implement new methods
Installation
pip install baguetter
Quick Start
from baguetter.indices import BMXSparseIndex
# Create an index
idx = BMXSparseIndex()
# Add documents
docs = [
"We all love baguette and cheese",
"Baguette is a great bread",
"Cheese is a great source of protein",
"Baguette is a great source of carbs",
]
doc_ids = ["1", "2", "3", "4"]
idx.add_many(doc_ids, docs, show_progress=True)
# Search
results = idx.search("quick fox")
print(results)
# Search many
results = idx.search_many(["quick fox", "baguette is great"])
print(results)
Evaluation
Baguetter includes tools for evaluating search performance on standard benchmarks:
from baguetter.evaluation import datasets, evaluate_retrievers
from baguetter.indices import BM25SparseIndex, BMXSparseIndex
results = evaluate_retrievers(datasets.mteb_datasets_small, {"bm25": BM25SparseIndex, "bmx": BMXSparseIndex})
results.save("eval_results")
Documentation
For more detailed usage instructions and API documentation, please refer to the full documentation.
Contributing
Contributions are welcome! We are using the GitHub Pull Request workflow. Either open an issue first and create a PR or include a comprehensive commit message when opening a PR.
To get started, please create a clone of the repo (or a fork). We recommend working in a virtual environment.
python -m pip install -e ".[dev]"
pre-commit install
To test your changes, run:
pytest
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Acknowledgements
Baguetter builds upon the work of several open-source projects:
-
retriv by AmenRa: Baguetter is a fork of retriv, adjusting it to our needs.
-
bm25s by xhluca: Our BM25 implementation is based on this project, which provides an efficient and effective implementation of the BM25 algorithm with different scoring functions.
-
USearch by unum-cloud for dense retrival.
Please check out the respective repositories and show some appreciation to the authors.
Citing
@article{li2024bmx,
title={BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search},
author={Xianming Li and Julius Lipp and Aamir Shakir and Rui Huang and Jing Li},
year={2024},
eprint={2408.06643},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2408.06643},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for baguetter-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33f6a8b891571c79efd2e153aa3fc20f67e341d3ec9375954315fd9acc4fd901 |
|
MD5 | 9998addfdc9ed835d9de58bd9e232823 |
|
BLAKE2b-256 | 559e27b3f8c9b686351a6558986628b5f6cd352cc125af771a5bde5804d4bf22 |