A package for fast indexing of multiple documents and their metadata on Pinecone.
Project description
DocIndex: Fast Persistent Document Embeddings Storage for RAG
Efficiently store multiple document embeddings and their metadata, whether they're offline or online, in a persistent Pinecone Vector Database optimized for Retrieval Augmented Generation (RAG) applications fast
Features
- ⚡️ Rapid Indexing: Quickly index multiple documents along with their metadata, including source, page details, and content, into Pinecone DB.
- 📚 Document Flexibility: Index documents from your local storage or online sources with ease.
- 📂 Format Support: Seamlessly handle various document formats, including PDF, docx(in-development), etc.
- 🔁 Embedding Services Integration: Enjoy support for multiple embedding services such as OpenAI Embeddings, Google Generative AI Embeddings and more in development.
- 🛠️ Configurable Vectorstore: Configure a vectorstore directly from the index to facilitate RAG pipelines effortlessly.
Setup
pip install docindex
Getting Started
- Sign up to Pinecone and get an API key.
Using OpenAI
from _openai.docindex import OpenaiPineconeIndexer
# Replace these values with your actual Pinecone API key, index name, OpenAI API key
pinecone_api_key = "pinecone-api-key"
index_name = "index-name" # e.g index-1
openai_api_key = "openai-api-key"
batch_limit = 20 # Batch limit for upserting documents
chunk_size = 256 # Optional: size of texts per chunk.
# List of URLs of the documents to be indexed. (offline on your computer or online)
urls = [
"your-document-1.pdf",
"your-document-2.md",
"your-document-3.html",
"your-document-4.docx",
]
# Initialize the Pinecone indexer
pinecone_indexer = OpenaiPineconeIndexer(index_name, pinecone_api_key, openai_api_key)
# To create a new Index
pinecone_indexer.create_index()
# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)
# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)
# To delete the created Index
pinecone_indexer.delete_index()
Using Google Generative AI
from _google.docindex import GooglePineconeIndexer
# Replace these values with your actual Pinecone API key, index name, Google API key
pinecone_api_key = "pinecone-api-key"
index_name = "index-name" # e.g index-1
google_api_key = "google-api-key"
batch_limit = 20 # Batch limit for upserting documents
chunk_size = 256 # Optional: size of texts per chunk.
# List of URLs of the documents to be indexed. (offline on your computer or an online)
urls = [
"your-document-1.pdf",
"your-document-2.pdf"
]
pinecone_indexer = GooglePineconeIndexer(index_name, pinecone_api_key, google_api_key)
# To create a new Index
pinecone_indexer.create_index()
# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)
# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)
# To delete the created Index
pinecone_indexer.delete_index()
Using the CLI
- Clone the Repository: Clone or download the application code to your local machine.
git clone https://github.com/KevKibe/docindex.git
- Create a virtual environment for the project and activate it.
# Navigate to project repository
cd docindex
# create virtual environment
python -m venv venv
# activate virtual environment
source venv/bin/activate
- Install dependencies by running this command
pip install -r requirements.txt
- Navigate to src
cd src
- To create an index
# Using OpenAI
python -m _openai.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key"
# Using Google Generative AI
python -m _google.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key"
- Run the command to start indexing the documents
# Using OpenAI
python -m _openai.doc_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key" --batch_limit 10 --docs "doc-1.pdf" "doc-2.pdf' --chunk_size 256
# Using Google Generative AI
python -m _google.doc_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key" --batch_limit 10 --docs "doc-1.pdf" "doc-2.pdf' --chunk_size 256
- To delete an index
# Using OpenAI
python -m _openai.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key"
# Using Google Generative AI
python -m _google.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key"
Contributing
🌟 First consider giving it a star at the top right. It means a lot! Contributions are welcome and encouraged. Before contributing, please take a moment to review our Contribution Guidelines for important information on how to contribute to this project.
If you're unsure about anything or need assistance, don't hesitate to reach out to us or open an issue to discuss your ideas.
We look forward to your contributions!
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For any enquiries, please reach out to me through keviinkibe@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.