A Python module that allows conversion of a document into chunks to be inserted into Pinecone vector database

Project description

📚 PreVectorChunks

A lightweight utility for document chunking and vector database upserts — designed for developers building RAG (Retrieval-Augmented Generation) solutions.

✨ Who Needs This Module?

Any developer working with:

RAG pipelines
Vector Databases (like Pinecone, Weaviate, etc.)
AI applications requiring similar content retrieval

🎯 What Does This Module Do?

This module helps you:

Chunk documents into smaller fragments using:
- a pretrained Reinforcement Learning based model or
- a pretrained Reinforcement Learning based model with proposition indexing or
- standard word chunking
- recursive character based chunking
- character based chunking
Insert (upsert) fragments into a vector database
Fetch & update existing chunks from a vector database

📦 Installation

pip install prevectorchunks-core

How to import in a file:

from PreVectorChunks.services import chunk_documents_crud_vdb

Use .env for API keys:IMPORTANT: PLEASE ENSURE TO PROVIDE YOUR OPENAI_API_KEY as MINIMUM in an .env file or as required

PINECONE_API_KEY=YOUR_API_KEY
OPENAI_API_KEY=YOUR_API_KEY

📄 Functions

1. `chunk_documents`

chunk_documents(instructions, file_path="content_playground/content.json", splitter_config=SplitterConfig())

Splits the content of a document into smaller, manageable chunks. - Five types of document chunking

Chunking using Reinforcement Learning based pretrained model +(enable/disable LLM to structure the chunked text - default is enabled)
Chunking using Reinforcement Learning based pretrained model and proposition indexing +(enable/disable LLM to structure the chunked text - default is enabled)
Recursive Character based chunking +(enable/disable LLM to structure the chunked text - default is enabled)
Standard word based chunking+(enable/disable LLM to structure the chunked text - default is enabled)
Simple character based chunking +(enable/disable LLM to structure the chunked text - default is enabled)

Parameters

instructions (dict or str): Additional rules or guidance for how the document should be split.
- Example: "split my content by biggest headings"
file_path (str): Binary file or file path to the input file containing the content or content of the file. Default: "content_playground/content.json".
splitter_config (optional) (SplitterConfig): (if none provided standard split takes place) Object that defines chunking behavior, e.g., chunk_size, chunk_overlap, separator, split_type.
i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type=SplitType.RECURSIVE.value)
(chunk_size refers to size in characters (i.e. 100 characters) when RECURSIVE is used)
i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type=SplitType.CHARACTER.value)
- (chunk_size refers to size in characters (i.e. 100 characters) when CHARACTER is used)
i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type=SplitType.STANDARD.value)
- (chunk_size refers to size in words (i.e. 100 characters) when STANDARD is used)
i.e. splitter_config = SplitterConfig(separators=["\n"], split_type=SplitType.R_PRETRAINED.value, min_rl_chunk_size=5, max_rl_chunk_size=50,enableLLMTouchUp=False)
- (min_rl_chunk_size and max_rl_chunk_size refers to size in sentences (i.e. 100 sentences) when R_PRETRAINED is used)
i.e. splitter_config = SplitterConfig(separators=["\n"], split_type=SplitType.R_PRETRAINED_PROPOSITION.value, min_rl_chunk_size=5, max_rl_chunk_size=50,enableLLMTouchUp=False)
- (min_rl_chunk_size and max_rl_chunk_size refers to size in sentences (i.e. 100 sentences) when R_PRETRAINED_PROPOSITION is used)
Returns
A list of chunked strings including a unique id, a meaningful title and chunked text

Use Cases

Preparing text for LLM ingestion
Splitting text by structure (headings, paragraphs)
Vector database indexing

2. `chunk_and_upsert_to_vdb`

chunk_and_upsert_to_vdb(index_n, instructions, file_path="content_playground/content.json", splitter_config=SplitterConfig())

Splits a document into chunks (via chunk_documents) and inserts them into a Vector Database.

Parameters

index_n (str): The name of the VDB index where chunks should be stored.
instructions (dict or str): Rules for splitting content (same as chunk_documents).
file_path (str): Path to the document file or content of the file. Default: "content_playground/content.json".
splitter_config (SplitterConfig): Object that defines chunking behavior.

Returns

Confirmation of successful insert into the VDB.

Use Cases

Automated document preprocessing and storage for vector search
Preparing embeddings for semantic search

3. `fetch_vdb_chunks_grouped_by_document_name`

fetch_vdb_chunks_grouped_by_document_name(index_n)

Fetches existing chunks stored in the Vector Database, grouped by document name.

Parameters

index_n (str): The name of the VDB index.

Returns

A dictionary or list of chunks grouped by document name.

Use Cases

Retrieving all chunks of a specific document
Verifying what content has been ingested into the VDB

4. `update_vdb_chunks_grouped_by_document_name`

update_vdb_chunks_grouped_by_document_name(index_n, dataset)

Updates existing chunks in the Vector Database by document name.

Parameters

index_n (str): The name of the VDB index.
dataset (dict or list): The new data (chunks) to update existing entries.

Returns

Confirmation of update status.

Use Cases

Keeping VDB chunks up to date when documents change
Re-ingesting revised or corrected content

5. `markdown_and_chunk_documents`

from prevectorchunks_core.services.markdown_and_chunk_documents import MarkdownAndChunkDocuments

markdown_processor = MarkdownAndChunkDocuments()
mapped_chunks = markdown_processor.markdown_and_chunk_documents("example.pdf")

Description
This new function automatically:

Converts a document (PDF, DOCX, etc.) into images using DocuToImageConverter.
Extracts Markdown and text content from those images using DocuToMarkdownExtractor (powered by GPT).
Converts the extracted markdown text into RL-based chunks using ChunkMapper and chunk_documents.
Merges unmatched markdown segments into the final structured output.

Parameters

file_path (str): Path to the document (PDF, DOCX, or image) you want to process.

Returns

mapped_chunks (list[dict]): A list of markdown-based chunks with both markdown and chunked text content.

Example

if __name__ == "__main__":
    markdown_processor = MarkdownAndChunkDocuments()
    mapped_chunks = markdown_processor.markdown_and_chunk_documents("421307-nz-au-top-loading-washer-guide-shorter.pdf")
    print(mapped_chunks)

Use Cases

End-to-end document-to-markdown-to-chunks pipeline
Automating preprocessing for RAG/LLM ingestion
Extracting structured markdown for semantic search or content indexing

🚀 Example Workflow

from prevectorchunks_core.config import SplitterConfig

splitter_config = SplitterConfig(chunk_size=150, chunk_overlap=0, separator=["\n"], split_type=SplitType.R_PRETRAINED_PROPOSITION.value)

# Step 1: Chunk a document
chunks = chunk_documents(
    instructions="split my content by biggest headings",
    file_path="content_playground/content.json",
    splitter_config=splitter_config
)

splitter_config = SplitterConfig(chunk_size=300, chunk_overlap=0, separators=["\n"],
                                     split_type=SplitType.R_PRETRAINED_PROPOSITION.value, min_rl_chunk_size=5,
                                     max_rl_chunk_size=50,enableLLMTouchUp=False)

chunks=chunk_documents_crud_vdb.chunk_documents("extract", file_name=None, file_path="content.txt",splitter_config=splitter_config)

# Step 2: Insert chunks into VDB
chunk_and_upsert_to_vdb("my_index", instructions="split by headings", splitter_config=splitter_config)

# Step 3: Fetch stored chunks
docs = fetch_vdb_chunks_grouped_by_document_name("my_index")

# Step 4: Update chunks if needed
update_vdb_chunks_grouped_by_document_name("my_index", dataset=docs)

🛠 Use Cases

Preprocessing documents for LLM ingestion
Semantic search and Q&A systems
Vector database indexing and retrieval
Maintaining versioned document chunks

Project details

Release history Release notifications | RSS feed

0.1.41

Jun 18, 2026

0.1.40

May 11, 2026

This version

0.1.39

Dec 22, 2025

0.1.38

Dec 22, 2025

0.1.37

Dec 22, 2025

0.1.36

Nov 19, 2025

0.1.35

Nov 19, 2025

0.1.34

Nov 11, 2025

0.1.33

Nov 10, 2025

0.1.32

Nov 10, 2025

0.1.31

Nov 9, 2025

0.1.30

Nov 9, 2025

0.1.29

Nov 9, 2025

0.1.28

Nov 5, 2025

0.1.27

Nov 4, 2025

0.1.26

Nov 3, 2025

0.1.25

Oct 31, 2025

0.1.24

Oct 31, 2025

0.1.23

Oct 31, 2025

0.1.22

Oct 31, 2025

0.1.21

Oct 31, 2025

0.1.20

Oct 31, 2025

0.1.19

Oct 31, 2025

0.1.18

Oct 31, 2025

0.1.17

Oct 10, 2025

0.1.16

Oct 10, 2025

0.1.15

Oct 10, 2025

0.1.14

Oct 9, 2025

0.1.13

Oct 9, 2025

0.1.12

Oct 9, 2025

0.1.11

Oct 9, 2025

0.1.10

Oct 9, 2025

0.1.9

Oct 9, 2025

0.1.8

Oct 9, 2025

0.1.7

Oct 9, 2025

0.1.6

Oct 9, 2025

0.1.5

Sep 30, 2025

0.1.4

Sep 30, 2025

0.1.3

Sep 30, 2025

0.1.2

Sep 30, 2025

0.1.1

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prevectorchunks_core-0.1.39.tar.gz (222.2 kB view details)

Uploaded Dec 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prevectorchunks_core-0.1.39-py3-none-any.whl (227.5 kB view details)

Uploaded Dec 22, 2025 Python 3

File details

Details for the file prevectorchunks_core-0.1.39.tar.gz.

File metadata

Download URL: prevectorchunks_core-0.1.39.tar.gz
Upload date: Dec 22, 2025
Size: 222.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prevectorchunks_core-0.1.39.tar.gz
Algorithm	Hash digest
SHA256	`8cd4724b18bf80139da4e226fc32f614002b68fd193ef0df8ec70fff93863c92`
MD5	`bbd7aff7d5980c6438cea6d604e15e4e`
BLAKE2b-256	`a03270d17959d6f36c0f08b60d2e4c6fb47e1203174bef46b57a0612a7e3e7b2`

See more details on using hashes here.

File details

Details for the file prevectorchunks_core-0.1.39-py3-none-any.whl.

File metadata

Download URL: prevectorchunks_core-0.1.39-py3-none-any.whl
Upload date: Dec 22, 2025
Size: 227.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prevectorchunks_core-0.1.39-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0396d0c3b35736869f8cb2317559fb13dbc0d7dc7c62da910de0c07ae0a6beb4`
MD5	`d23aaab616f8ea33b4a5303eb179b3bb`
BLAKE2b-256	`82b5f9793a40317dcd9b7137b6ee6695a3ed81fb3f27ad3a3a9edd46c8f24690`

See more details on using hashes here.

prevectorchunks-core 0.1.39

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

📚 PreVectorChunks

✨ Who Needs This Module?

🎯 What Does This Module Do?

📦 Installation

📄 Functions

1. `chunk_documents`

2. `chunk_and_upsert_to_vdb`

3. `fetch_vdb_chunks_grouped_by_document_name`

4. `update_vdb_chunks_grouped_by_document_name`

5. `markdown_and_chunk_documents`

🚀 Example Workflow

🛠 Use Cases

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

prevectorchunks-core 0.1.39

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

📚 PreVectorChunks

✨ Who Needs This Module?

🎯 What Does This Module Do?

📦 Installation

📄 Functions

1. chunk_documents

2. chunk_and_upsert_to_vdb

3. fetch_vdb_chunks_grouped_by_document_name

4. update_vdb_chunks_grouped_by_document_name

5. markdown_and_chunk_documents

🚀 Example Workflow

🛠 Use Cases

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `chunk_documents`

2. `chunk_and_upsert_to_vdb`

3. `fetch_vdb_chunks_grouped_by_document_name`

4. `update_vdb_chunks_grouped_by_document_name`

5. `markdown_and_chunk_documents`