Skip to main content

Simple dense retrieval using SciPy, spaCy, and Sentence-Transformers

Project description

FASDR: Fast and Simple Dense Retrieval

🚧 WORK IN PROGRESS 🚧

FASDR is a simple and lightweight library for fast and efficient document retrieval. It is designed to be easy to setup and use, built on top of popular, trusted components (scipy, spacy, transformers) to ensure it can be seamlessly integrated into existing projects and "just work". It's especially well suited for small-to-medium corpora, such as retrieval-augmented prompting of FOSS documentation.

Features

  • Fast and efficient dense retrieval using KDTree data structures.
  • Simple interface for indexing and searching documents and sentences.
  • Support for various file formats and customizable indexing options.
  • Integration with the SpaCy and Sentence-BERT libraries for natural language processing and sentence embeddings.

Installation

First, install fasdr via pip:

pip install fasdr

Next, download sentence tokenization language model for spacy:

python -m spacy download en_core_web_trf

Quick Start

Indexing documents

Quick Start

To get started with FASDR, you can create a DocumentIndex object by passing in the root directory containing the documents you want to index:

from fasdr import DocumentIndex

index = DocumentIndex("/path/to/documents")

Once you have created the DocumentIndex object, you can search for documents or sentences using the search_documents and search_sentences methods:

# Find the top five documents relevant to the query "climate change"
results = index.search_documents("climate change", k=5)

# Find the top 10 sentences after filtering on the top 5 documents
results = index.search_sentences_targeted("climate change", n_docs=5, n_sents=10)

You can customize the behavior of the DocumentIndex object by specifying options such as the model name and the file extensions to include in the index:

index = DocumentIndex(
    "/path/to/documents",
    model_name="all-MiniLM-L6-v2",
    extensions=[".txt", ".md", ".pdf"]
)

Design

FASDR is designed to be fast and simple, with a focus on ease of use and minimal setup. It uses FAISS for similarity search, which is a highly optimized library for dense vector search, and SpaCy with the Sentence-BERT component for embedding text. The library is built around two main classes:

  • Document: Represents a single document and its embeddings.
  • DocumentIndex: Represents an index of documents and their embeddings.

Document objects are created by passing in the path to the document file, and can be used to search for similar sentences within the document. DocumentIndex objects are created by passing in the root directory containing the documents to index, and can be used to search for similar documents or sentences across all the indexed documents.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fasdr-0.0.6.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

fasdr-0.0.6-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file fasdr-0.0.6.tar.gz.

File metadata

  • Download URL: fasdr-0.0.6.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fasdr-0.0.6.tar.gz
Algorithm Hash digest
SHA256 7c2ab5a8d30f50fdd5985b80e3adf1aacd8d557cdc1c8a43b4a4490c27d4a5c4
MD5 ea645b55c49c28d330342493a7f3e842
BLAKE2b-256 b62795301e68e83908396f226b99cf0ef05e10749669032dbdb13d952f0eef6e

See more details on using hashes here.

File details

Details for the file fasdr-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: fasdr-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fasdr-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 bcd64be806df8ae45ecda80755d1f9b38e383437830c373775940f080acda624
MD5 c3187771900f8124edea08442bd7a581
BLAKE2b-256 3b560ae5fe891bc5f8b14b628077454731f50520c49ff08a2bd31c91fe787fe9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page