An open-source python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed for Dell PowerScale storage.
Project description
PowerScale RAG Connector
The PowerScale RAG Connector is an open-source Python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed. It leverages PowerScale's unique MetadataIQ capability to identify changed files within the OneFS filesystem and publish this information in an easily consumable format via ElasticSearch.
Developers can integrate the PowerScale RAG Connector directly within a LangChain RAG application as a supported document loader or use it independently as a generic Python class.
Workflow
Figure 1: Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software.
Audience
The intended audience for this document includes software developers, machine learning scientists, and AI developers who will utilize files from PowerScale in the development of a RAG application.
Overview
This guide is divided into two sections: setting up the environment and using the connector. Note that system administration privileges are required for the initial configuration on PowerScale, which may need to be performed by PowerScale administrators.
Terminology
Term | Definition |
---|---|
RAG | Retrieval Augmented Generation. A technique used to take an off the shelf large language model and provide the LLM context to data it has no knowledge of. |
LangChain | LangChain is an open-source python and javascript framework used to help developers create RAG applications. |
Nvidia NIM Services | Part of Nvidia AI Enterprise, a set of microservices that can optional be used to efficiently chunk and embed files with GPU. The output of this data can be stored in a vector database for a RAG framework to use. |
NV-Ingest | An Nvidia NIM microservice that will ingest complex office documents files with tables, and figures, and produce chunks and embedding to be stored in a vector database. |
Chunking | The process of splitting the source file into smaller context aware pieces that can be searched and converted into vectors. Example: a chunk could be every paragraph within a large office document |
Embedding | Turning a chunk of data into a vector where vector operations such as similarity, can be performed. |
MetadataIQ | A new feature in PowerScale OneFS 9.10 that will periodically save filesystem metadata to an external database such as Elasticsearch |
PowerScale RAG Connector | An open-source connector that can integrate with LangChain to improve data ingestion when data resides on PowerScale. |
Installation
pip install powerscale-rag-connector
Installing NVIDIA Ingest Client
To use the NVIDIA Ingest client with the PowerScale RAG Connector, you'll need to install the NVIDIA Ingest client library. This code has been tested with nv-ingest v24.12.1.
For more detailed information about the NVIDIA Ingest client library, refer to the official NVIDIA NV-Ingest client documentation.
Usage
The PowerScale RAG Connector can be used in two ways:
- As a LangChain document loader
- As a standalone Python class
Using as a LangChain Document Loader
from powerscale_rag_connector import PowerScaleDocumentLoader
# Initialize the loader
loader = PowerScaleDocumentLoader(
es_host_url="http://elasticsearch:9200",
es_index_name="metadataiq",
es_api_key="your-api-key",
folder_path="/ifs/data"
)
# Load documents
documents = loader.load()
Using as a Standalone Path Loader
from powerscale_rag_connector import PowerScalePathLoader
# Initialize the loader
loader = PowerScalePathLoader(
es_host_url="http://elasticsearch:9200",
es_index_name="metadataiq",
es_api_key="your-api-key",
folder_path="/ifs/data"
)
# Get changed files
changed_files = loader.lazy_load()
Examples
Check out the examples directory for complete usage examples:
Components
The connector consists of several modules:
- PowerScalePathLoader: Core module for identifying changed files
- PowerScaleDocumentLoader: Custom DocumentLoader for LangChain integration
- PowerScaleUnstructuredLoader: Custom Loader returning Documents processed by LangChain's UnstructuredFileLoader
Requirements
- Python 3.8+
- Elasticsearch client
- PowerScale OneFS 9.10+ with MetadataIQ configured
- LangChain (optional, for LangChain integration)
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file powerscale_rag_connector-1.0.9.tar.gz
.
File metadata
- Download URL: powerscale_rag_connector-1.0.9.tar.gz
- Upload date:
- Size: 90.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 973ebcb45c7f676e9df1db9ebd97bf733262e66aa056dd9c8c00f147c41ec497 |
|
MD5 | 94a09f172d246104c8f1d89d5b316858 |
|
BLAKE2b-256 | 943e477e131fc356b8aec77567549a259c0afad995fd0bc5a43603814a7fa8cb |
File details
Details for the file powerscale_rag_connector-1.0.9-py3-none-any.whl
.
File metadata
- Download URL: powerscale_rag_connector-1.0.9-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da6786a36d7dcf285d586d165b9e688fecc48fa1597af41405549c29132592ce |
|
MD5 | 71092bf6af44adde228cc075f4f38e11 |
|
BLAKE2b-256 | 5fb3c492508adcacf9e7b4610337736259166b5e7eb16599c5ad830bcd39f05f |