Simple dense retrieval using SciPy, spaCy, and Sentence-Transformers
Project description
FASDR: Fast and Simple Dense Retrieval
🚧 WORK IN PROGRESS 🚧
FASDR is a simple and lightweight library for fast and efficient document retrieval. It is designed to be easy to setup and use, built on top of popular, trusted components (scipy
, spacy
, transformers
) to ensure it can be seamlessly integrated into existing projects and "just work". It's especially well suited for small-to-medium corpora, such as retrieval-augmented prompting of FOSS documentation.
Features
- Fast and efficient dense retrieval using KDTree data structures.
- Simple interface for indexing and searching documents and sentences.
- Support for various file formats and customizable indexing options.
- Integration with the SpaCy and Sentence-BERT libraries for natural language processing and sentence embeddings.
Installation
First, install fasdr
via pip:
pip install fasdr
Next, download sentence tokenization language model for spacy:
python -m spacy download en_core_web_trf
Quick Start
Indexing documents
Quick Start
To get started with FASDR, you can create a DocumentIndex
object by passing in the root directory containing the documents you want to index:
from fasdr import DocumentIndex
index = DocumentIndex("/path/to/documents")
Once you have created the DocumentIndex object, you can search for documents or sentences using the search_documents and search_sentences methods:
# Find the top five documents relevant to the query "climate change"
results = index.search_documents("climate change", k=5)
# Find the top 10 sentences after filtering on the top 5 documents
results = index.search_sentences_targeted("climate change", n_docs=5, n_sents=10)
You can customize the behavior of the DocumentIndex object by specifying options such as the model name and the file extensions to include in the index:
index = DocumentIndex(
"/path/to/documents",
model_name="all-MiniLM-L6-v2",
extensions=[".txt", ".md", ".pdf"]
)
Design
FASDR is designed to be fast and simple, with a focus on ease of use and minimal setup. It uses FAISS for similarity search, which is a highly optimized library for dense vector search, and SpaCy with the Sentence-BERT component for embedding text. The library is built around two main classes:
Document
: Represents a single document and its embeddings.DocumentIndex
: Represents an index of documents and their embeddings.
Document
objects are created by passing in the path to the document file, and can be used to search for similar sentences within the document. DocumentIndex
objects are created by passing in the root directory containing the documents to index, and can be used to search for similar documents or sentences across all the indexed documents.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fasdr-0.0.6.tar.gz
.
File metadata
- Download URL: fasdr-0.0.6.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c2ab5a8d30f50fdd5985b80e3adf1aacd8d557cdc1c8a43b4a4490c27d4a5c4 |
|
MD5 | ea645b55c49c28d330342493a7f3e842 |
|
BLAKE2b-256 | b62795301e68e83908396f226b99cf0ef05e10749669032dbdb13d952f0eef6e |
File details
Details for the file fasdr-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: fasdr-0.0.6-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcd64be806df8ae45ecda80755d1f9b38e383437830c373775940f080acda624 |
|
MD5 | c3187771900f8124edea08442bd7a581 |
|
BLAKE2b-256 | 3b560ae5fe891bc5f8b14b628077454731f50520c49ff08a2bd31c91fe787fe9 |