Skip to main content

A package for fast persistent storage of multiple document embeddings and their metadata into Pinecone for production-level RAG.

Project description

DocIndex: Fast Persistent Document Embeddings Storage for Production-Level RAG

Last commit License

Diagram

Efficiently store multiple document embeddings and their metadata, whether they're offline or online, in a persistent Pinecone Vector Database optimized for production-level Retrieval Augmented Generation (RAG) applications fast

Features

  • ⚡️ Rapid Indexing: Quickly index multiple documents along with their metadata, including source, page details, and content, into Pinecone DB.
  • 📚 Document Flexibility: Index documents from your local storage or online sources with ease.
  • 📂 Format Support: Store various document formats, including PDF, docx(in-development), etc.
  • 🔁 LLM Providers Integration: Enjoy support for multiple LLM providers such as OpenAI, Google Generative AI, Cohere and more in development.
  • 🛠️ Configurable Vectorstore: Configure a vectorstore directly from the index to facilitate RAG pipelines effortlessly.
  • 🔍 Initialize a RAG Retriever: Spin up a RAG retriever using your vectorstore.

Setup

pip install docindex

Getting Started

  • Sign up to Pinecone and get an API key.

Import Dependencies

# Import the appropriate indexer class based on the language model you want to use
from _openai.doc_index import OpenaiPineconeIndexer  # Using OpenAI
from _google.doc_index import GooglePineconeIndexer  # Using GoogleGenerativeAI
from _cohere.doc_index import CoherePineconeIndexer  # Using Cohere

API Keys and Configuration

# Replace these values with your actual API keys and index name or have them in a variable environment/secret key.
pinecone_api_key = "your_pinecone_api_key"  # Your Pinecone API key
index_name = "your_index_name"  # Your Pinecone index name

openai_api_key = "your_openai_api_key"     # Using OpenAI)
# google_api_key = "your_google_api_key"   # Using GoogleGenerativeAI
#cohere_api_key = "your_cohere_api_key"    # Using Cohere

# Configure batch limit and chunk size
batch_limit = "batch-limit" # Maximum batch size for upserting documents -> Optional: Default 32
chunk_size = "chunk-size"  # Size of texts per chunk -> Optional: Default 256

List of Documents:

# List of URLs of the documents to be indexed. (offline on your computer or online)
urls = [
 "your-document-1.pdf",
 "your-document-2.md",
 "your-document-3.html",
 "your-document-4.docx",
]

Index, Store Documents and Initialize VectorStore

# Initialize the Pinecone indexer with the desired model

pinecone_indexer = OpenaiPineconeIndexer(index_name, pinecone_api_key, openai_api_key)    # Using OpenAI
# pinecone_indexer = GooglePineconeIndexer(index_name, pinecone_api_key, google_api_key)  # Using GoogleGenerativeAI
# pinecone_indexer = CoherePineconeIndexer(index_name, pinecone_api_key, cohere_api_key)  # Using Cohere


# To create a new Index
pinecone_indexer.create_index()

# Store the document embeddings with the specified URLs and batch limit
pinecone_indexer.index_documents(urls,batch_limit,chunk_size)

# Initialize the Vectorstore
vectorstore = pinecone_indexer.initialize_vectorstore(index_name)

Query and Retrieve Information

query = "what is the transformers architecture"
response = pinecone_indexer.retrieve_and_generate(
                    vector_store = vectorstore, 
                    query = query, 
                    top_k = "number of sources to retrieve",    # Default is 3
                    pydantic_parser=True                        # Whether to use Pydantic parsing for the generated response (default is True)
                    rerank_model = "reranking model"            # Default is 'flashrank'  Other models available Docs:https://github.com/AnswerDotAI/rerankers
                    )
response

Response

query='what is the transformers architecture' result='The Transformer follows this overall architecture using stacked self-attention and point-wise, fully-connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.' page=1 source_documents=[Document(page_content='Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [ 11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512 .\nDecoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two', source=2.0, title='https://arxiv.org/pdf/1706.03762.pdf')]

Query Response Attributes:

response.query                 # The query that was submitted
response.result                # The result of the query, including any retrieved information.
response.source_documents      # A list of source documents related to the query.
response.sources               # A list of the sources (page numbers) from the source documents.
response.titles                # A list of the titles from the source documents.
response.page_contents         # A list of the page contents from the source documents.

Delete Index:

# To delete the created Index
pinecone_indexer.delete_index()

Using the CLI

  • Clone the Repository: Clone or download the application code to your local machine.
git clone https://github.com/KevKibe/docindex.git
  • Create a virtual environment for the project and activate it.
# Navigate to project repository
cd docindex

# create virtual environment
python -m venv venv

# activate virtual environment
source venv/bin/activate
  • Install dependencies by running this command
pip install -r requirements.txt
  • Navigate to src
cd src
  • To create an index
# Using OpenAI 
python -m  utils.create_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" 
  • Run the command to start indexing the documents
# Using OpenAI

python -m _openai.index_documents  --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --openai_api_key "your_openai_api_key" --batch_limit "batch-limit" --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size "chunk-size"

# Using Google Generative AI

python -m _google.index_documents  --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --google_api_key "your_google_api_key" --batch_limit "batch-limit" --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size "chunk-size"

Using Cohere

python -m _cohere.index_documents   --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" --cohere_api_key "your_google_api_key" --batch_limit "batch-limit" --docs  "doc-1.pdf" "doc-2.pdf' --chunk_size "chunk-size"
  • To delete an index
# Using OpenAI 
python -m  utils.delete_index --pinecone_api_key "your_pinecone_api_key" --index_name "your_index_name" 

Contributing

🌟 First consider giving it a star at the top right. It means a lot! Contributions are welcome and encouraged. Before contributing, please take a moment to review our Contribution Guidelines for important information on how to contribute to this project.

If you're unsure about anything or need assistance, don't hesitate to reach out to us or open an issue to discuss your ideas.

We look forward to your contributions!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any enquiries, please reach out to me through keviinkibe@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docindex-0.9.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

docindex-0.9.0-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file docindex-0.9.0.tar.gz.

File metadata

  • Download URL: docindex-0.9.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for docindex-0.9.0.tar.gz
Algorithm Hash digest
SHA256 ec53c10e8b7e628c7fb959a4dcad392577c5bf790791bafa5158f2876d92ea24
MD5 5e5b8ef8db93988be7dd75f2dcbf497a
BLAKE2b-256 fde9324d963375562df21d0431351a53428a6dfb14ed96907f02d60cc94fb050

See more details on using hashes here.

File details

Details for the file docindex-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: docindex-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for docindex-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd5c3504916761abd2dcce4f43a903a44a80bf906cb9be9010033deb77f8d154
MD5 648a3b6abd811deb14cad800f2c30f00
BLAKE2b-256 319f0a6996f2278209a6fcf5a5f03d24b35212bc803c439eaa2cbbbd63cdf3f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page