Skip to main content

An open-source python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed for Dell PowerScale storage.

Project description

PowerScale RAG Connector

The PowerScale RAG Connector is an open-source Python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed. It leverages PowerScale's unique MetadataIQ capability to identify changed files within the OneFS filesystem and publish this information in an easily consumable format via ElasticSearch.

Developers can integrate the PowerScale RAG Connector directly within a LangChain RAG application as a supported document loader or use it independently as a generic Python class.

Workflow

Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software

Figure 1: Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software.

Audience

The intended audience for this document includes software developers, machine learning scientists, and AI developers who will utilize files from PowerScale in the development of a RAG application.

Overview

This guide is divided into two sections: setting up the environment and using the connector. Note that system administration privileges are required for the initial configuration on PowerScale, which may need to be performed by PowerScale administrators.

Terminology

Term Definition
RAG Retrieval Augmented Generation. A technique used to take an off the shelf large language model and provide the LLM context to data it has no knowledge of.
LangChain LangChain is an open-source python and javascript framework used to help developers create RAG applications.
Nvidia NIM Services Part of Nvidia AI Enterprise, a set of microservices that can optional be used to efficiently chunk and embed files with GPU. The output of this data can be stored in a vector database for a RAG framework to use.
NV-Ingest An Nvidia NIM microservice that will ingest complex office documents files with tables, and figures, and produce chunks and embedding to be stored in a vector database.
Chunking The process of splitting the source file into smaller context aware pieces that can be searched and converted into vectors. Example: a chunk could be every paragraph within a large office document
Embedding Turning a chunk of data into a vector where vector operations such as similarity, can be performed.
MetadataIQ A new feature in PowerScale OneFS 9.10 that will periodically save filesystem metadata to an external database such as Elasticsearch
PowerScale RAG Connector An open-source connector that can integrate with LangChain to improve data ingestion when data resides on PowerScale.

Installation

pip install powerscale-rag-connector

Installing NVIDIA Ingest Client

To use the NVIDIA Ingest client with the PowerScale RAG Connector, you'll need to install the NVIDIA Ingest client library. This code has been tested with nv-ingest v24.12.1.

For more detailed information about the NVIDIA Ingest client library, refer to the official NVIDIA NV-Ingest client documentation.

Usage

The PowerScale RAG Connector can be used in two ways:

  1. As a LangChain document loader
  2. As a standalone Python class

Using as a LangChain Document Loader

from powerscale_rag_connector import PowerScaleDocumentLoader

# Initialize the loader
loader = PowerScaleDocumentLoader(
    es_host_url="http://elasticsearch:9200",
    es_index_name="metadataiq",
    es_api_key="your-api-key",
    folder_path="/ifs/data"
)

# Load documents
documents = loader.load()

Using as a Standalone Path Loader

from powerscale_rag_connector import PowerScalePathLoader

# Initialize the loader
loader = PowerScalePathLoader(
    es_host_url="http://elasticsearch:9200",
    es_index_name="metadataiq",
    es_api_key="your-api-key",
    folder_path="/ifs/data"
)

# Get changed files
changed_files = loader.lazy_load()

Examples

Check out the examples directory for complete usage examples:

Components

The connector consists of several modules:

Requirements

  • Python 3.8+
  • Elasticsearch client
  • PowerScale OneFS 9.10+ with MetadataIQ configured
  • LangChain (optional, for LangChain integration)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

powerscale_rag_connector-1.0.9.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

powerscale_rag_connector-1.0.9-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file powerscale_rag_connector-1.0.9.tar.gz.

File metadata

File hashes

Hashes for powerscale_rag_connector-1.0.9.tar.gz
Algorithm Hash digest
SHA256 973ebcb45c7f676e9df1db9ebd97bf733262e66aa056dd9c8c00f147c41ec497
MD5 94a09f172d246104c8f1d89d5b316858
BLAKE2b-256 943e477e131fc356b8aec77567549a259c0afad995fd0bc5a43603814a7fa8cb

See more details on using hashes here.

File details

Details for the file powerscale_rag_connector-1.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for powerscale_rag_connector-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 da6786a36d7dcf285d586d165b9e688fecc48fa1597af41405549c29132592ce
MD5 71092bf6af44adde228cc075f4f38e11
BLAKE2b-256 5fb3c492508adcacf9e7b4610337736259166b5e7eb16599c5ad830bcd39f05f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page