Skip to main content

A package for fast indexing of multiple documents and their metadata on Pinecone.

Project description

DocIndex: Fast Persistent Document Embeddings Storage for RAG

Last commit License

Diagram

Efficiently store multiple document embeddings and their metadata, whether they're offline or online, in a persistent Pinecone Vector Database optimized for Retrieval Augmented Generation (RAG) applications fast

Features

  • ⚡️ Rapid Indexing: Quickly index multiple documents along with their metadata, including source, page details, and content, into Pinecone DB.
  • 📚 Document Flexibility: Index documents from your local storage or online sources with ease.
  • 📂 Format Support: Seamlessly handle various document formats, including PDF, docx(in-development), etc.
  • 🔁 Embedding Services Integration: Enjoy support for multiple embedding services such as OpenAI Embeddings, Google Generative AI Embeddings and more in development.
  • 🛠️ Configurable Vectorstore: Configure a vectorstore directly from the index to facilitate RAG pipelines effortlessly.

Setup

pip install docindex

Getting Started

  • Sign up to Pinecone and get an API key.

Using OpenAI

Colab

from _openai.docindex import OpenaiPineconeIndexer

# Replace these values with your actual Pinecone API key, index name, OpenAI API key
pinecone_api_key = "pinecone-api-key"
index_name = "index-name"               # e.g index-1
openai_api_key = "openai-api-key"
batch_limit = 20                        # Batch limit for upserting documents
chunk_size = 256                        # Optional: size of texts per chunk. 

# List of URLs of the documents to be indexed. (offline on your computer or online)
urls = [
 "your-document-1.pdf",
 "your-document-2.md",
 "your-document-3.html",
 "your-document-4.docx",
]

# Initialize the Pinecone indexer
pinecone_indexer = OpenaiPineconeIndexer(index_name, pinecone_api_key, openai_api_key)

# To create a new Index
pinecone_indexer.create_index()

# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)

# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)
# To delete the created Index
pinecone_indexer.delete_index()

Using Google Generative AI

Colab

from _google.docindex import GooglePineconeIndexer

# Replace these values with your actual Pinecone API key, index name, Google API key
pinecone_api_key = "pinecone-api-key"
index_name = "index-name"                # e.g index-1
google_api_key = "google-api-key"
batch_limit = 20                         # Batch limit for upserting documents
chunk_size = 256                         # Optional: size of texts per chunk. 

# List of URLs of the documents to be indexed. (offline on your computer or an online)
urls = [
 "your-document-1.pdf",
 "your-document-2.pdf"
]

pinecone_indexer = GooglePineconeIndexer(index_name, pinecone_api_key, google_api_key)

# To create a new Index
pinecone_indexer.create_index()

# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)

# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)
# To delete the created Index
pinecone_indexer.delete_index()

Using the CLI

  • Clone the Repository: Clone or download the application code to your local machine.
git clone https://github.com/KevKibe/docindex.git
  • Create a virtual environment for the project and activate it.
# Navigate to project repository
cd docindex

# create virtual environment
python -m venv venv

# activate virtual environment
source venv/bin/activate
  • Install dependencies by running this command
pip install -r requirements.txt
  • Navigate to src
cd src
  • To create an index
# Using OpenAI 
python -m  _openai.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key"

# Using Google Generative AI
python -m  _google.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key"
  • Run the command to start indexing the documents
# Using OpenAI 
python -m _openai.doc_index  --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key" --batch_limit 10 --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size 256 
# Using Google Generative AI 
python -m _google.doc_index  --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key" --batch_limit 10 --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size 256 
  • To delete an index
# Using OpenAI 
python -m  _openai.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key"

# Using Google Generative AI
python -m  _google.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key"

Contributing

🌟 First consider giving it a star at the top right. It means a lot! Contributions are welcome and encouraged. Before contributing, please take a moment to review our Contribution Guidelines for important information on how to contribute to this project.

If you're unsure about anything or need assistance, don't hesitate to reach out to us or open an issue to discuss your ideas.

We look forward to your contributions!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any enquiries, please reach out to me through keviinkibe@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docindex-0.5.0.tar.gz (10.7 kB view hashes)

Uploaded Source

Built Distribution

docindex-0.5.0-py3-none-any.whl (14.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page