A modular text embedding and vector database pipeline for local and cloud vector stores.

These details have not been verified by PyPI

Project description

vectorDBpipe

A Modular, End-to-End RAG Pipeline for Production-Ready Vector Search.

vectorDBpipe is a robust framework designed to automate the heavy lifting of building RAG (Retrieval-Augmented Generation) systems. It seamlessly handles data ingestion, text cleaning, semantic embedding, and storage in modern vector databases.

🎯 Project Objectives

Building a vector search system often involves writing the same "glue code" over and over again:

Parsing PDFs, Word docs, and Text files.
Cleaning funny characters and whitespace.
Chunking long text so it fits into context windows.
Batching embeddings to avoid OOM (Out-of-Memory) errors.
Creating and managing database indexes.

vectorDBpipe solves this. It is a "download-and-go" architected solution that reduces weeks of boilerplate work into a standardized config.yaml file.

Ideal for:

AI Engineers building internal RAG tools.
Developers needing to "chat with their data" instantly.
Researchers testing different embedding models or databases (switch from Chroma to Pinecone in 1 line).

🛠️ Tech Stack & Architecture

This project utilizes best-in-class open-source technologies:

Ingestion: PyMuPDF (PDF), python-docx (DOCX), pandas (CSV), BeautifulSoup (HTML).
Vectorization: sentence-transformers (HuggingFace compatible).
Vector Database:
- ChromaDB (Local, persistent, file-based).
- Pinecone (Serverless, Cloud-native v3.0+).
- FAISS (Via underlying libraries or custom adapters).
Orchestration: Custom batch-processing Pipeline.

🏗️ Architecture Flow

graph LR
    A[Raw Data Folder] --> B(DataLoader);
    B --> C{Cleaner & Chunker};
    C --Batching--> D[Embedder Model];
    D --> E[(Vector Database)];
    E --> F[Semantic Search API];
    F --> G[RAG Application / Chatbot];

💡 Use Cases

1. Enterprise Knowledge Base

Company wikis, PDFs, and policy documents are scattered.

Solution: Point vectorDBpipe to the shared drive. It indexes 10,000+ docs into Pinecone.
Result: Employees get instant, semantic answers ("What is the travel policy?") instead of keyword search.

2. Legal / Medical Document Search

Long documents need to be split intelligently.

Solution: Use the standardized chunker (e.g., 512 tokens with overlap).
Result: Retrieval finds the exact paragraph containing the clause or diagnosis.

3. Rapid Prototype for RAG

You have a hackathon idea but don't want to spend 4 hours setting up FAISS.

Solution: pip install vectordbpipe -> pipeline.run().
Result: Working MVP in 5 minutes.

📦 Installation

Install the package directly from PyPI:

pip install vectordbpipe

🔧 Windows Users (DLL Error Constraints)

If you encounter WinError 1114 or DLL initialization errors with Torch, install the CPU-optimized binaries:

pip install -r requirements-cpu.txt

(This forces intel-openmp and CPU-only libraries to ensure stability on non-CUDA machines).

⚙️ Configuration

Control your entire pipeline via config.yaml. No need to touch the code.

# vectorDBpipe/config/config.yaml

paths:
  data_dir: "data/"  # Folder containing your .pdf, .txt, .docx files

model:
  name: "sentence-transformers/all-MiniLM-L6-v2" # Any HF model
  batch_size: 32

vector_db:
  type: "pinecone"   # Options: "chroma" or "pinecone"
  index_name: "my-knowledge-base"
  # For Chroma, use:
  # persist_directory: "data/chroma_store"

🔐 Credentials

Do NOT hardcode API keys. The system looks for environment variables:

Linux/Mac:

export PINECONE_API_KEY="your-secret-key"

Windows PowerShell:

$env:PINECONE_API_KEY="your-secret-key"

🚀 Step-by-Step Demo: The "10-Line" RAG Pipeline

This script detects all files in your data/ folder, processes them in memory-safe batches, and makes them searchable.

from vectorDBpipe.pipeline.text_pipeline import TextPipeline

# ---------------------------------------------------------
# STEP 1: Initialize Pipeline
# ---------------------------------------------------------
# Reads config.yaml, sets up logging, connects to DB (Pinecone/Chroma)
pipeline = TextPipeline()

# ---------------------------------------------------------
# STEP 2: Ingest Data (The "Magic" Step)
# ---------------------------------------------------------
# Loops through all files in 'data_dir', cleans text, splits into 
# 512-token chunks, embeds them using HuggingFace model, 
# and uploads to DB in batches of 100 to save RAM.
pipeline.process(batch_size=100)

# ---------------------------------------------------------
# STEP 3: Semantic Search
# ---------------------------------------------------------
query = "How does vectorDBpipe reduce workload?"
results = pipeline.search(query, top_k=3)

print("--- Search Results ---")
for match in results:
    # Metadata contains the original text chunk and source file name
    print(f"Source: {match.get('metadata', {}).get('source', 'unknown')}")
    print(f"Content: {match.get('metadata', {}).get('text', '')[:200]}...\n")

🧠 Deep Dive: How It Reduces Work

Before `vectorDBpipe` vs. After

Feature	The "Hard Way" (Manual)	The `vectorDBpipe` Way
PDF Parsing	Write `fitz` loops, handle exceptions, merge pages.	`loader.load_data()` handles PDF, DOCX, TXT, HTML auto-magically.
Chunking	Write regex wrappers, handle overlaps, off-by-one errors.	`chunk_text(text, chunk_size=512)` built-in utility.
Embeddings	Manually loop `model.encode()`, manage tensors.	`Embedder` class abstracts this away (Mock fallback included).
Scalability	"Out of Memory" when loading 1000 PDFs.	Batch Processing built-in. Flushes data every 100 chunks.
DB Switching	Rewrite insert logic for Pinecone connection vs Chroma.	Change `type: pinecone` in YAML. Done.

Code Snippet: Scalable Batch Processing

Use the new process(batch_size=N) method introduced in v0.1.3 to handle massive datasets.

# Even if you have 10GB of text files, this won't crash your RAM.
pipeline.process(batch_size=50)

📁 Project Structure

vectorDBpipe/
├── config/             # YAML configuration
├── data/               # Drop your raw files here
├── vectorDBpipe/
│   ├── data/           # Loader logic (PDF/DOCX/TXT parsers)
│   ├── embeddings/     # SentenceTransformer wrapper
│   ├── pipeline/       # The "Brain" (Process & Search flow)
│   └── vectordb/       # Store adapters (Chroma/Pinecone)
└── requirements.txt    # Production deps

🤝 Contributing

We welcome issues and PRs!

Report Bugs: Create an issue on GitHub.
Updates: We are working on adding Qdrant and Weaviate support in v0.2.0.

Author: Yash Desai
License: MIT

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.4

Mar 3, 2026

0.2.3

Mar 3, 2026

0.2.2

Mar 1, 2026

0.2.1

Feb 28, 2026

0.2.0

Feb 28, 2026

0.1.9

Feb 18, 2026

0.1.6

Jan 7, 2026

0.1.5

Jan 7, 2026

This version

0.1.4

Jan 7, 2026

0.1.3

Jan 7, 2026

0.1.2

Oct 10, 2025

0.1.1

Oct 10, 2025

0.1.0

Oct 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectordbpipe-0.1.4.tar.gz (19.8 kB view details)

Uploaded Jan 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vectordbpipe-0.1.4-py3-none-any.whl (19.3 kB view details)

Uploaded Jan 7, 2026 Python 3

File details

Details for the file vectordbpipe-0.1.4.tar.gz.

File metadata

Download URL: vectordbpipe-0.1.4.tar.gz
Upload date: Jan 7, 2026
Size: 19.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectordbpipe-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`6fc01cb02552845ccb2480a1770991d0d3ad78479dad1365327acda487aec3e0`
MD5	`223000fd2cfdda2919e0d5bde3453b62`
BLAKE2b-256	`7f109416c18e0e4963ca7f07f14bf51e0c0d2cc401bd6df6ce6cd569cc4a2f2a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectordbpipe-0.1.4.tar.gz:

Publisher: publish-to-pypi.yml on yashdesai023/vectorDBpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vectordbpipe-0.1.4.tar.gz
- Subject digest: 6fc01cb02552845ccb2480a1770991d0d3ad78479dad1365327acda487aec3e0
- Sigstore transparency entry: 800467916
- Sigstore integration time: Jan 7, 2026
Source repository:
- Permalink: yashdesai023/vectorDBpipe@dd4f705a3c85bbce025ea548b4eb09885a45a347
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/yashdesai023
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@dd4f705a3c85bbce025ea548b4eb09885a45a347
- Trigger Event: release

File details

Details for the file vectordbpipe-0.1.4-py3-none-any.whl.

File metadata

Download URL: vectordbpipe-0.1.4-py3-none-any.whl
Upload date: Jan 7, 2026
Size: 19.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectordbpipe-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`378ef65e072606b40beeedd14f0b8f630a6b106f00bf999d6aa62cea16d1e72a`
MD5	`c05450717aaba1dab21b726bf7a4e625`
BLAKE2b-256	`f909fc1752647fac4809d0410db6ec7fc79a64f12eedcdc6aead3adedc4957fd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectordbpipe-0.1.4-py3-none-any.whl:

Publisher: publish-to-pypi.yml on yashdesai023/vectorDBpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vectordbpipe-0.1.4-py3-none-any.whl
- Subject digest: 378ef65e072606b40beeedd14f0b8f630a6b106f00bf999d6aa62cea16d1e72a
- Sigstore transparency entry: 800467966
- Sigstore integration time: Jan 7, 2026
Source repository:
- Permalink: yashdesai023/vectorDBpipe@dd4f705a3c85bbce025ea548b4eb09885a45a347
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/yashdesai023
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@dd4f705a3c85bbce025ea548b4eb09885a45a347
- Trigger Event: release

vectordbpipe 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

vectorDBpipe

🎯 Project Objectives

🛠️ Tech Stack & Architecture

🏗️ Architecture Flow

💡 Use Cases

1. Enterprise Knowledge Base

2. Legal / Medical Document Search

3. Rapid Prototype for RAG

📦 Installation

🔧 Windows Users (DLL Error Constraints)

⚙️ Configuration

🔐 Credentials

🚀 Step-by-Step Demo: The "10-Line" RAG Pipeline

🧠 Deep Dive: How It Reduces Work

Before vectorDBpipe vs. After

Code Snippet: Scalable Batch Processing

📁 Project Structure

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Before `vectorDBpipe` vs. After