Skip to main content

A package that abstracts most of the utilities used in RAG applications

Project description

ragutils

  • A Python package that abstracts RAG (Retrieval-Augmented Generation) utilities, providing unified document loaders, embedding models, text splitters, and vector databases in one package.
  • Retriever and Rerankers coming soon
  • Langchain based

BENEFITS

  • Unified Interface: Simplifies complex processes by providing a unified interface for various RAG utilities.
  • Flexibility: Supports multiple document sources, splitting methods, embedding providers, and vector databases, enhancing adaptability to diverse use cases.
  • Scalability: Designed to accommodate future functionalities, ensuring long-term viability and relevance.

INSTALL AND RUN

pip install ragutils

Document Loaders

The DocumentLoader class is used to load documents from different sources and return them as a list of strings. The supported sources include CSV, JSON, PDF, HTML, Markdown, powerpoint and Word documents.

The purpose of the DocumentLoader class is to provide a single interface for loading documents from various sources and to ensure that the process of loading documents is consistent across different sources. This allows the code that uses the DocumentLoader class to be more flexible and easier to maintain.

By using the DocumentLoader class, the code can be written in a way that is independent of the specific source of the document. This makes it easier to modify or extend the code in the future, as new sources of documents can be added without affecting the rest of the code.

Example usage:

from ragutils.document_loader import DocumentLoader
loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load_and_split(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

Text Splitters

The TextSplitter class is used to split text into a list of document objects. It can be used to preprocess the text data before indexing it to a vector database.

The TextSplitter class provides a number of different splitters, including splitters that split based on HTML headers, Markdown headers, character boundaries, recursive character boundaries, and semantic chunks. The splitters can be configured with arguments to control the splitting process, such as the maximum length of the chunks or the set of headers to split on.

The TextSplitter class is designed to be flexible and can be used with a wide range of text data, including HTML documents, Markdown documents, and plain text. It is also designed to be scalable in future.

Example usage:

from ragutils.document_loaders import DocumentLoader
from ragutils.text_splitter import TextSplitter
document_loader = DocumentLoader()
splitter = TextSplitter(splitter="recursive")

file_path = os.path.join("path/to/file.csv")
data = loader.load_and_split(file_path)
documents = splitter.split_to_documents(data= data, chunk_size = 1000, chunk_overlap=20 )

Embedding Providers

The EmbeddingProvider class is responsible for providing different embedding functions based on the embedding_provider specified in the settings file. It initializes the class with the specified embedding_provider and provides the get_embedding_function method to retrieve the embedding function based on the model_name.

Example usage:

from ragutils.document_loader import DocumentLoader
from ragutils.text_splitter import TextSplitter
from ragutils.embedding_provider import EmbeddingProvider

loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load_and_split(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

# Split the text
splitter = TextSplitter(splitter="recursive")
documents = splitter.split_to_documents(data=data, chunk_size=1000, chunk_overlap=20)

# Get the embedding function
embedding_provider = EmbeddingProvider(embedding_provider="openai")
embedding_function = embedding_provider.get_embedding_function()

Vector Databases

The purpose of the VectorDatabase class is to manage different vector databases, such as Chroma, Pinecone, Milvus, Qdrant, DocArrayInMemorySearch, and Faiss. It provides a consistent interface for creating and managing indexes for different vector databases.

Example usage

from ragutils.document_loader import DocumentLoader
from ragutils.text_splitter import TextSplitter
from ragutils.embedding_provider import EmbeddingProvider
from ragutils.vector_databases import VectorDatabase

loader = DocumentLoader()
file_path = os.path.join("path/to/file.csv")
data = loader.load_and_split(file_path)
url = "https://example.com/document.html"
data = loader.load(url)

# Split the text
splitter = TextSplitter(splitter="recursive")
documents = splitter.split_to_documents(data=data, chunk_size=1000, chunk_overlap=20)

# Get the embedding function
embedding_provider = EmbeddingProvider(embedding_provider="openai")
embedding_function = embedding_provider.get_embedding_function()

#embed the documents
index_dir = 'tests/index/'
vector_database = VectorDatabase(vector_store='chroma', index_name='test_index')
vector_database.create_index(embedding_function, documents, index_dir)

Retrievers (Coming Soon)

Stay tuned for retriever functionalities.

Rerankers (Coming Soon)

Stay tuned for reranker functionalities.

CONTRIBUTE

Feel free to contribute to ragutils by submitting bug reports, feature requests, or pull requests on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragutils-0.1.0.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragutils-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file ragutils-0.1.0.tar.gz.

File metadata

  • Download URL: ragutils-0.1.0.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for ragutils-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69f01f15f335f17c1271720076dfb39ea98b329382985b9e4082641212e25793
MD5 71ac9e82056d86934756909648804e2b
BLAKE2b-256 be27938d7d8b81780ec83821a0afc2114a7bceb5043d166010a5f44642cc6d06

See more details on using hashes here.

File details

Details for the file ragutils-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragutils-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for ragutils-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8ca745ef9287ddcd46e240e4a867df1290073528e316b900b5bb88f87ac363a
MD5 11fa3fcf3a0ad14efd8f44ecefa0d42e
BLAKE2b-256 93ed4882a02d9f84b7901ce24c8607a57c000a32d44541936975a08a71d7f28d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page