A package for fast indexing of multiple documents and their metadata on Pinecone.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

DocIndex: Fast Persistent Document Embeddings Storage for RAG

Diagram

Efficiently store multiple document embeddings and their metadata, whether they're offline or online, in a persistent Pinecone Vector Database optimized for Retrieval Augmented Generation (RAG) applications fast

Features

⚡️ Rapid Indexing: Quickly index multiple documents along with their metadata, including source, page details, and content, into Pinecone DB.
📚 Document Flexibility: Index documents from your local storage or online sources with ease.
📂 Format Support: Seamlessly handle various document formats, including PDF, docx(in-development), etc.
🔁 Embedding Services Integration: Enjoy support for multiple embedding services such as OpenAI Embeddings, Google Generative AI Embeddings and more in development.
🛠️ Configurable Vectorstore: Configure a vectorstore directly from the index to facilitate RAG pipelines effortlessly.

Setup

pip install docindex

Getting Started

Using OpenAI

from _openai.docindex import OpenaiPineconeIndexer

# Replace these values with your actual Pinecone API key, index name, OpenAI API key
pinecone_api_key = "pinecone-api-key"
index_name = "index-name"               # e.g index-1
openai_api_key = "openai-api-key"
batch_limit = 20                        # Batch limit for upserting documents
chunk_size = 256                        # Optional: size of texts per chunk. 

# List of URLs of the documents to be indexed. (offline on your computer or online)
urls = [
 "your-document-1.pdf",
 "your-document-2.md",
 "your-document-3.html",
 "your-document-4.docx",
]

# Initialize the Pinecone indexer
pinecone_indexer = OpenaiPineconeIndexer(index_name, pinecone_api_key, openai_api_key)

# To create a new Index
pinecone_indexer.create_index()

# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)

# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)

# To delete the created Index
pinecone_indexer.delete_index()

Using Google Generative AI

from _google.docindex import GooglePineconeIndexer

# Replace these values with your actual Pinecone API key, index name, Google API key
pinecone_api_key = "pinecone-api-key"
index_name = "index-name"                # e.g index-1
google_api_key = "google-api-key"
batch_limit = 20                         # Batch limit for upserting documents
chunk_size = 256                         # Optional: size of texts per chunk. 

# List of URLs of the documents to be indexed. (offline on your computer or an online)
urls = [
 "your-document-1.pdf",
 "your-document-2.pdf"
]

pinecone_indexer = GooglePineconeIndexer(index_name, pinecone_api_key, google_api_key)

# To create a new Index
pinecone_indexer.create_index()

# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)

# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)

# To delete the created Index
pinecone_indexer.delete_index()

Using the CLI

Clone the Repository: Clone or download the application code to your local machine.

git clone https://github.com/KevKibe/docindex.git

Create a virtual environment for the project and activate it.

# Navigate to project repository
cd docindex

# create virtual environment
python -m venv venv

# activate virtual environment
source venv/bin/activate

Install dependencies by running this command

pip install -r requirements.txt

Navigate to src

cd src

To create an index

# Using OpenAI 
python -m  _openai.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key"

# Using Google Generative AI
python -m  _google.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key"

Run the command to start indexing the documents

# Using OpenAI 
python -m _openai.doc_index  --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key" --batch_limit 10 --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size 256

# Using Google Generative AI 
python -m _google.doc_index  --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key" --batch_limit 10 --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size 256

To delete an index

# Using OpenAI 
python -m  _openai.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key"

# Using Google Generative AI
python -m  _google.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key"

Contributing

🌟 First consider giving it a star at the top right. It means a lot! Contributions are welcome and encouraged. Before contributing, please take a moment to review our Contribution Guidelines for important information on how to contribute to this project.

If you're unsure about anything or need assistance, don't hesitate to reach out to us or open an issue to discuss your ideas.

We look forward to your contributions!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any enquiries, please reach out to me through keviinkibe@gmail.com

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.0

May 16, 2024

0.6.0

Apr 30, 2024

This version

0.5.0

Apr 22, 2024

0.4.0

Apr 9, 2024

0.3.0

Apr 9, 2024

0.2.0

Apr 8, 2024

0.1.0

Apr 8, 2024

0.0.1

Apr 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docindex-0.5.0.tar.gz (10.7 kB view hashes)

Uploaded Apr 22, 2024 Source

Built Distribution

docindex-0.5.0-py3-none-any.whl (14.9 kB view hashes)

Uploaded Apr 22, 2024 Python 3

Hashes for docindex-0.5.0.tar.gz

Hashes for docindex-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`47b2937c0fdcdb2a1dbdbd655f58aae605360883bd5c597664459b0a95c5653d`
MD5	`73cca55be7934c63f89df16ed04a81e9`
BLAKE2b-256	`32b24dda62dc0ebb82fef48b2f9ba861a3a50f2c2ee38aaf8121137255d4911a`

Hashes for docindex-0.5.0-py3-none-any.whl

Hashes for docindex-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c71e663924627ad479bcd14bf3d51b9860cb616a0c805dbbd882747608bf10d8`
MD5	`80eb35e865d34ecb9035f85e12c76558`
BLAKE2b-256	`3e9716eb385820ebb2d2738d5b0b91e4c7c49ed94292243575e48cd91fc2fadc`