Skip to main content

A Python module that allows conversion of a document into chunks to be inserted into Pinecone vector database

Project description

📚 PreVectorChunks

A lightweight utility for document chunking and vector database upserts — designed for developers building RAG (Retrieval-Augmented Generation) solutions.


✨ Who Needs This Module?

Any developer working with:

  • RAG pipelines
  • Vector Databases (like Pinecone, Weaviate, etc.)
  • AI applications requiring similar content retrieval

🎯 What Does This Module Do?

This module helps you:

  • Chunk documents into smaller fragments
  • Insert (upsert) fragments into a vector database
  • Fetch & update existing chunks from a vector database

📦 Installation

pip install prevectorchunks-core

How to import in a file:

from PreVectorChunks.services import chunk_documents_crud_vdb

Use .env for API keys:

PINECONE_API_KEY=YOUR_API_KEY
OPENAI_API_KEY=YOUR_API_KEY

📄 Functions

1. chunk_documents

chunk_documents(instructions, file_path="content_playground/content.json", splitter_config=SplitterConfig())

Splits the content of a document into smaller, manageable chunks.

Parameters

  • instructions (dict or str): Additional rules or guidance for how the document should be split.
    • Example: "split my content by biggest headings"
  • file_path (str): Path to the input JSON/text file containing the content or content of the file. Default: "content_playground/content.json".
  • splitter_config (optional) (SplitterConfig): (if none provided standard split takes place) Object that defines chunking behavior, e.g., chunk_size, chunk_overlap, separator, split_type.
  • i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type="RecursiveCharacterTextSplitter")
  • i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type="CharacterTextSplitter")
  • i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type="standard") Returns
  • A list of chunked strings including a unique id, a meaningful title and chunked text

Use Cases

  • Preparing text for LLM ingestion
  • Splitting text by structure (headings, paragraphs)
  • Vector database indexing

2. chunk_and_upsert_to_vdb

chunk_and_upsert_to_vdb(index_n, instructions, file_path="content_playground/content.json", splitter_config=SplitterConfig())

Splits a document into chunks (via chunk_documents) and inserts them into a Vector Database.

Parameters

  • index_n (str): The name of the VDB index where chunks should be stored.
  • instructions (dict or str): Rules for splitting content (same as chunk_documents).
  • file_path (str): Path to the document file or content of the file. Default: "content_playground/content.json".
  • splitter_config (SplitterConfig): Object that defines chunking behavior.

Returns

  • Confirmation of successful insert into the VDB.

Use Cases

  • Automated document preprocessing and storage for vector search
  • Preparing embeddings for semantic search

3. fetch_vdb_chunks_grouped_by_document_name

fetch_vdb_chunks_grouped_by_document_name(index_n)

Fetches existing chunks stored in the Vector Database, grouped by document name.

Parameters

  • index_n (str): The name of the VDB index.

Returns

  • A dictionary or list of chunks grouped by document name.

Use Cases

  • Retrieving all chunks of a specific document
  • Verifying what content has been ingested into the VDB

4. update_vdb_chunks_grouped_by_document_name

update_vdb_chunks_grouped_by_document_name(index_n, dataset)

Updates existing chunks in the Vector Database by document name.

Parameters

  • index_n (str): The name of the VDB index.
  • dataset (dict or list): The new data (chunks) to update existing entries.

Returns

  • Confirmation of update status.

Use Cases

  • Keeping VDB chunks up to date when documents change
  • Re-ingesting revised or corrected content

🚀 Example Workflow

from prevectorchunks_core.config import SplitterConfig

splitter_config = SplitterConfig(chunk_size=150, chunk_overlap=0, separator=["\n"], split_type="RecursiveCharacterTextSplitter")

# Step 1: Chunk a document
chunks = chunk_documents(
    instructions="split my content by biggest headings",
    file_path="content_playground/content.json",
    splitter_config=splitter_config
)

# Step 2: Insert chunks into VDB
chunk_and_upsert_to_vdb("my_index", instructions="split by headings", splitter_config=splitter_config)

# Step 3: Fetch stored chunks
docs = fetch_vdb_chunks_grouped_by_document_name("my_index")

# Step 4: Update chunks if needed
update_vdb_chunks_grouped_by_document_name("my_index", dataset=docs)

🛠 Use Cases

  • Preprocessing documents for LLM ingestion
  • Semantic search and Q&A systems
  • Vector database indexing and retrieval
  • Maintaining versioned document chunks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prevectorchunks_core-0.1.14.tar.gz (206.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prevectorchunks_core-0.1.14-py3-none-any.whl (207.7 kB view details)

Uploaded Python 3

File details

Details for the file prevectorchunks_core-0.1.14.tar.gz.

File metadata

  • Download URL: prevectorchunks_core-0.1.14.tar.gz
  • Upload date:
  • Size: 206.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prevectorchunks_core-0.1.14.tar.gz
Algorithm Hash digest
SHA256 6a5bbe2fa9af6e75082292449e2cf21138f885614dbc3e952b95cad3b01c9f73
MD5 f45e3a0b1d3709e3830674bec102859f
BLAKE2b-256 b813f7bc8e38d4da05764161d83a773c13cbe51b317401bf160f479b48a66bbe

See more details on using hashes here.

File details

Details for the file prevectorchunks_core-0.1.14-py3-none-any.whl.

File metadata

File hashes

Hashes for prevectorchunks_core-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 f39827a77ce8df5603846b9ce8173133453122fd4c097e16745b22b0d7aaf497
MD5 23d602ffcc09dd0fdb6c8c7a6bf293a9
BLAKE2b-256 32ef3ddffb377663148a9d55c4e2742d7c8eb7c5b4fb37f6b4a3a14edf8e94f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page