Simple dense retrieval using SciPy, spaCy, and Sentence-Transformers

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

FASDR: Fast and Simple Dense Retrieval

🚧 WORK IN PROGRESS 🚧

FASDR is a simple and lightweight library for fast and efficient document retrieval. It is designed to be easy to setup and use, built on top of popular, trusted components (scipy, spacy, transformers) to ensure it can be seamlessly integrated into existing projects and "just work". It's especially well suited for small-to-medium corpora, such as retrieval-augmented prompting of FOSS documentation.

Features

Fast and efficient dense retrieval using KDTree data structures.
Simple interface for indexing and searching documents and sentences.
Support for various file formats and customizable indexing options.
Integration with the SpaCy and Sentence-BERT libraries for natural language processing and sentence embeddings.

Installation

First, install fasdr via pip:

pip install fasdr

Next, download sentence tokenization language model for spacy:

python -m spacy download en_core_web_trf

Quick Start

Indexing documents

Quick Start

To get started with FASDR, you can create a DocumentIndex object by passing in the root directory containing the documents you want to index:

from fasdr import DocumentIndex

index = DocumentIndex("/path/to/documents")

Once you have created the DocumentIndex object, you can search for documents or sentences using the search_documents and search_sentences methods:

# Find the top five documents relevant to the query "climate change"
results = index.search_documents("climate change", k=5)

# Find the top 10 sentences after filtering on the top 5 documents
results = index.search_sentences_targeted("climate change", n_docs=5, n_sents=10)

You can customize the behavior of the DocumentIndex object by specifying options such as the model name and the file extensions to include in the index:

index = DocumentIndex(
    "/path/to/documents",
    model_name="all-MiniLM-L6-v2",
    extensions=[".txt", ".md", ".pdf"]
)

Design

FASDR is designed to be fast and simple, with a focus on ease of use and minimal setup. It uses FAISS for similarity search, which is a highly optimized library for dense vector search, and SpaCy with the Sentence-BERT component for embedding text. The library is built around two main classes:

Document: Represents a single document and its embeddings.
DocumentIndex: Represents an index of documents and their embeddings.

Document objects are created by passing in the path to the document file, and can be used to search for similar sentences within the document. DocumentIndex objects are created by passing in the root directory containing the documents to index, and can be used to search for similar documents or sentences across all the indexed documents.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

0.0.6

Apr 20, 2023

0.0.5

Apr 20, 2023

0.0.4

Apr 19, 2023

0.0.2

Apr 19, 2023

0.0.1

Apr 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fasdr-0.0.6.tar.gz (8.8 kB view details)

Uploaded Apr 20, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fasdr-0.0.6-py3-none-any.whl (7.7 kB view details)

Uploaded Apr 20, 2023 Python 3

File details

Details for the file fasdr-0.0.6.tar.gz.

File metadata

Download URL: fasdr-0.0.6.tar.gz
Upload date: Apr 20, 2023
Size: 8.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fasdr-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`7c2ab5a8d30f50fdd5985b80e3adf1aacd8d557cdc1c8a43b4a4490c27d4a5c4`
MD5	`ea645b55c49c28d330342493a7f3e842`
BLAKE2b-256	`b62795301e68e83908396f226b99cf0ef05e10749669032dbdb13d952f0eef6e`

See more details on using hashes here.

File details

Details for the file fasdr-0.0.6-py3-none-any.whl.

File metadata

Download URL: fasdr-0.0.6-py3-none-any.whl
Upload date: Apr 20, 2023
Size: 7.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fasdr-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bcd64be806df8ae45ecda80755d1f9b38e383437830c373775940f080acda624`
MD5	`c3187771900f8124edea08442bd7a581`
BLAKE2b-256	`3b560ae5fe891bc5f8b14b628077454731f50520c49ff08a2bd31c91fe787fe9`

See more details on using hashes here.

fasdr 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

FASDR: Fast and Simple Dense Retrieval

Features

Installation

Quick Start

Indexing documents

Quick Start

Design

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes