A retrieval package build on top of Open source models

These details have not been verified by PyPI

Project links

Homepage

Project description

open_retrieval

A Python package based on langchain that abstracts RAG (Retrieval-Augmented Generation) utilities, providing unified document loaders, embedding models, text splitters, vector databases and retrievers in one package based on open source models.
No api keys needed

BENEFITS

Unified Interface: Simplifies complex processes by providing a unified interface for various RAG utilities.
Flexibility: Supports multiple document sources, splitting methods, embedding providers, vector databases, retrievers, enhancing adaptability to diverse use cases.
Scalability: Designed to accommodate future functionalities, ensuring long-term viability and relevance.

INSTALL AND RUN

pip install open_retrieval

Document Loaders

The DocumentLoaders class is used to load documents from different sources and return them as a list of strings. The supported sources include csv files, json files, pdf files , html_files, markdown, word_documents, and powerpoint files.

The purpose of the DocumentLoader class is to provide a single interface for loading documents from various sources and to ensure that the process of loading documents is consistent across different sources. This allows the code that uses the DocumentLoader class to be more flexible and easier to maintain.

By using the DocumentLoader class, the code can be written in a way that is independent of the specific source of the document. This makes it easier to modify or extend the code in the future, as new sources of documents can be added without affecting the rest of the code.

Example usage:

from open_retrieval.document_loaders import DocumentLoader
loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

Text Splitters

The TextSplitters class is used to split text into a list of document objects. It can be used to preprocess the text data before indexing it to a vector database.

The TextSplitter class provides a number of different splitters, including splitters that split based on htmlheader, characters, markdownheader, paragraphs(recursive), or tokens. The splitters can be configured with arguments to control the splitting process, such as the maximum length of the chunks or the set of headers to split on and also include extra metadata in the resulting chunks.

The TextSplitter class is designed to be flexible and can be used with a wide range of text data, including HTML documents, Markdown documents, and plain text. It is also designed to be scalable in future.

Example usage:

from open_retrieval.document_loaders import DocumentLoader
from open_retrieval.text_splitters import TextSplitter
document_loader = DocumentLoader()
splitter = TextSplitter(splitter="recursive")

file_path = os.path.join("path/to/file.csv")
data = loader.load(file_path)
documents = splitter.split(data, chunk_size = 800, chunk_overlap=0, extra_metadata=extra_metadata)

Embedding Providers

The EmbeddingProviders class is responsible for providing different embedding functions based on the embedding_provider specified. It initializes the class with the specified embedding_provider and provides the get_embedding_function method to retrieve the embedding function based on the model_name.

At the moment we currently support embedding models from huggingface, fastembed and ollama.

Example usage:

from open_retrieval.document_loaders import DocumentLoader
from open_retrieval.text_splitters import TextSplitter
from open_retrieval.embedding_providers import EmbeddingProvider

loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

# Split the text
splitter = TextSplitter(splitter="recursive")
documents = splitter.split(data, chunk_size = 800, chunk_overlap=0, extra_metadata=extra_metadata)

# Get the embedding function
embedding_type = 'huggingface' #fastembed
embedding_provider = EmbeddingProvider(embedding_provider=embedding_type)
embedding_function = embedding_provider.get_embedding_function()

Vector Databases

The purpose of the VectorDatabase class is to manage different vector databases, such aschroma, milvus, qdrant, faiss or array. It provides a consistent interface for creating and managing indexes for different vector databases.

Example usage

from open_retrieval.document_loaders import DocumentLoader
from open_retrieval.text_splitters import TextSplitter
from open_retrieval.embedding_providers import EmbeddingProvider
from open_retrieval.vector_databases import VectorDatabase

loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

# Split the text
splitter = TextSplitter(splitter="recursive")
all_documents = splitter.split(data, chunk_size = 800, chunk_overlap=0, extra_metadata=extra_metadata)

# Get the embedding function
embedding_type = 'huggingface' #fastembed
embedding_provider = EmbeddingProvider(embedding_provider=embedding_type)
embedding_function = embedding_provider.get_embedding_function()

# embed the documents
database = 'chroma' #type of the database
index_dir = 'tests/index/' #path to the folder where the indexes are stored
index_name = f"{database}_index_{embedding_provider}" #name of the index
vector_database = VectorDatabase(vector_store=database)
vector_database.create_index(embedding_function, documents, index_dir)
vector_index = vector_database.create_index(embedding_function=embedding_function,docs=all_documents, index_name=index_name,index_dir=index_dir)

Retrievers

The purpose of the Retriever class is to manage different retrival techniques such as naive_retrieval and ranked_retrieval. It provides a consistent interface for creating and managing different retrival techniques It uses the unified rerankers API by answerdotai : https://github.com/AnswerDotAI/rerankers

Example usage

from open_retrieval.document_loaders import DocumentLoader
from open_retrieval.text_splitters import TextSplitter
from open_retrieval.embedding_providers import EmbeddingProvider
from open_retrieval.vector_databases import VectorDatabase
from open_retrieval.retrievers import Retriever
from rerankers import Reranker

loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

# Split the text
splitter = TextSplitter(splitter="recursive")
all_documents = splitter.split(data, chunk_size = 800, chunk_overlap=0, extra_metadata=extra_metadata)

# Get the embedding function
embedding_type = 'huggingface' #fastembed
embedding_provider = EmbeddingProvider(embedding_provider=embedding_type)
embedding_function = embedding_provider.get_embedding_function()

# embed the documents
database = 'chroma' #type of the database
index_dir = 'tests/index/' #path to the folder where the indexes are stored
index_name = f"{database}_index_{embedding_provider}" #name of the index
vector_database = VectorDatabase(vector_store=database)
vector_database.create_index(embedding_function, documents, index_dir)
vector_index = vector_database.create_index(embedding_function=embedding_function,docs=all_documents, index_name=index_name,index_dir=index_dir)

# retrieve top-k contexts
retrieval_type = 'ranked' #type of retrieval: naive or ranked
ranking_model = "colbert" #ranking model to choose from
filter_params = {'file_name': 'who_guidelines'} #filter parameters 
ranker = Reranker(ranking_model, verbose=0)
retriever = Retriever(vector_index=vector_index, ranker = ranker)
results = retriever.ranked_retrieval( query=query, top_k=15, filter = filter_params )

CONTRIBUTE

Feel free to contribute to open_retrieval by submitting bug reports, feature requests, or pull requests on GitHub.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

Jun 3, 2024

0.0.1

Jun 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_retrieval-0.0.2.tar.gz (21.9 kB view details)

Uploaded Jun 3, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_retrieval-0.0.2-py3-none-any.whl (19.4 kB view details)

Uploaded Jun 3, 2024 Python 3

File details

Details for the file open_retrieval-0.0.2.tar.gz.

File metadata

Download URL: open_retrieval-0.0.2.tar.gz
Upload date: Jun 3, 2024
Size: 21.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for open_retrieval-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`3aeba37df6811901c20015c4272a6c34bcdfa8c1f34ced1cb015efa5715bd489`
MD5	`5681363b105fa3a22b1f932bff2923e5`
BLAKE2b-256	`fc2545c945cc44b6e58fe51d9135f974ae8827c34fe4032111794f46226b71aa`

See more details on using hashes here.

File details

Details for the file open_retrieval-0.0.2-py3-none-any.whl.

File metadata

Download URL: open_retrieval-0.0.2-py3-none-any.whl
Upload date: Jun 3, 2024
Size: 19.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for open_retrieval-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6177034ead657a2e7ddf384d26129969efb0bf297145746e103a722233d0f3b9`
MD5	`02f45b556f1c465f67f893b8fd231d71`
BLAKE2b-256	`11dfed1629042072b275e9a455a394d77e72013845f2b3d9aa90cfc4202a18f0`

See more details on using hashes here.

open-retrieval 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

open_retrieval

BENEFITS

INSTALL AND RUN

Document Loaders

Example usage:

Text Splitters

Example usage:

Embedding Providers

Example usage:

Vector Databases

Example usage

Retrievers

Example usage

CONTRIBUTE

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes