A modular text embedding and vector database pipeline for local and cloud vector stores.
Project description
vectorDBpipe
A Modular, End-to-End RAG Pipeline for Production-Ready Vector Search.
vectorDBpipe is a robust framework designed to automate the heavy lifting of building RAG (Retrieval-Augmented Generation) systems. It seamlessly handles data ingestion, text cleaning, semantic embedding, and storage in modern vector databases.
🎯 Project Objectives
Building a vector search system often involves writing the same "glue code" over and over again:
- Parsing PDFs, Word docs, and Text files.
- Cleaning funny characters and whitespace.
- Chunking long text so it fits into context windows.
- Batching embeddings to avoid OOM (Out-of-Memory) errors.
- Creating and managing database indexes.
vectorDBpipe solves this. It is a "download-and-go" architected solution that reduces weeks of boilerplate work into a standardized config.yaml file.
Ideal for:
- AI Engineers building internal RAG tools.
- Developers needing to "chat with their data" instantly.
- Researchers testing different embedding models or databases (switch from Chroma to Pinecone in 1 line).
🛠️ Tech Stack & Architecture
This project utilizes best-in-class open-source technologies:
- Ingestion:
PyMuPDF(PDF),python-docx(DOCX),pandas(CSV),BeautifulSoup(HTML). - Vectorization:
sentence-transformers(HuggingFace compatible). - Vector Database:
- ChromaDB (Local, persistent, file-based).
- Pinecone (Serverless, Cloud-native v3.0+).
- FAISS (Via underlying libraries or custom adapters).
- Orchestration: Custom batch-processing Pipeline.
🏗️ Architecture Flow
graph LR
A[Raw Data Folder] --> B(DataLoader);
B --> C{Cleaner & Chunker};
C --Batching--> D[Embedder Model];
D --> E[(Vector Database)];
E --> F[Semantic Search API];
F --> G[RAG Application / Chatbot];
💡 Use Cases
1. Enterprise Knowledge Base
Company wikis, PDFs, and policy documents are scattered.
- Solution: Point
vectorDBpipeto the shared drive. It indexes 10,000+ docs into Pinecone. - Result: Employees get instant, semantic answers ("What is the travel policy?") instead of keyword search.
2. Legal / Medical Document Search
Long documents need to be split intelligently.
- Solution: Use the standardized chunker (e.g., 512 tokens with overlap).
- Result: Retrieval finds the exact paragraph containing the clause or diagnosis.
3. Rapid Prototype for RAG
You have a hackathon idea but don't want to spend 4 hours setting up FAISS.
- Solution:
pip install vectordbpipe->pipeline.run(). - Result: Working MVP in 5 minutes.
📦 Installation
Install the package directly from PyPI:
pip install vectordbpipe
🔧 Windows Users (DLL Error Constraints)
If you encounter WinError 1114 or DLL initialization errors with Torch, install the CPU-optimized binaries:
pip install -r requirements-cpu.txt
(This forces intel-openmp and CPU-only libraries to ensure stability on non-CUDA machines).
⚙️ Configuration
Control your entire pipeline via config.yaml. No need to touch the code.
# vectorDBpipe/config/config.yaml
paths:
data_dir: "data/" # Folder containing your .pdf, .txt, .docx files
model:
name: "sentence-transformers/all-MiniLM-L6-v2" # Any HF model
batch_size: 32
vector_db:
type: "pinecone" # Options: "chroma" or "pinecone"
index_name: "my-knowledge-base"
# For Chroma, use:
# persist_directory: "data/chroma_store"
🔐 Credentials
Do NOT hardcode API keys. The system looks for environment variables:
Linux/Mac:
export PINECONE_API_KEY="your-secret-key"
Windows PowerShell:
$env:PINECONE_API_KEY="your-secret-key"
🚀 Step-by-Step Demo: The "10-Line" RAG Pipeline
This script detects all files in your data/ folder, processes them in memory-safe batches, and makes them searchable.
from vectorDBpipe.pipeline.text_pipeline import TextPipeline
# ---------------------------------------------------------
# STEP 1: Initialize Pipeline
# ---------------------------------------------------------
# Reads config.yaml, sets up logging, connects to DB (Pinecone/Chroma)
pipeline = TextPipeline()
# ---------------------------------------------------------
# STEP 2: Ingest Data (The "Magic" Step)
# ---------------------------------------------------------
# Loops through all files in 'data_dir', cleans text, splits into
# 512-token chunks, embeds them using HuggingFace model,
# and uploads to DB in batches of 100 to save RAM.
pipeline.process(batch_size=100)
# ---------------------------------------------------------
# STEP 3: Semantic Search
# ---------------------------------------------------------
query = "How does vectorDBpipe reduce workload?"
results = pipeline.search(query, top_k=3)
print("--- Search Results ---")
for match in results:
# Metadata contains the original text chunk and source file name
print(f"Source: {match.get('metadata', {}).get('source', 'unknown')}")
print(f"Content: {match.get('metadata', {}).get('text', '')[:200]}...\n")
🧠 Deep Dive: How It Reduces Work
Before vectorDBpipe vs. After
| Feature | The "Hard Way" (Manual) | The vectorDBpipe Way |
|---|---|---|
| PDF Parsing | Write fitz loops, handle exceptions, merge pages. |
loader.load_data() handles PDF, DOCX, TXT, HTML auto-magically. |
| Chunking | Write regex wrappers, handle overlaps, off-by-one errors. | chunk_text(text, chunk_size=512) built-in utility. |
| Embeddings | Manually loop model.encode(), manage tensors. |
Embedder class abstracts this away (Mock fallback included). |
| Scalability | "Out of Memory" when loading 1000 PDFs. | Batch Processing built-in. Flushes data every 100 chunks. |
| DB Switching | Rewrite insert logic for Pinecone connection vs Chroma. | Change type: pinecone in YAML. Done. |
Code Snippet: Scalable Batch Processing
Use the new process(batch_size=N) method introduced in v0.1.3 to handle massive datasets.
# Even if you have 10GB of text files, this won't crash your RAM.
pipeline.process(batch_size=50)
📁 Project Structure
vectorDBpipe/
├── config/ # YAML configuration
├── data/ # Drop your raw files here
├── vectorDBpipe/
│ ├── data/ # Loader logic (PDF/DOCX/TXT parsers)
│ ├── embeddings/ # SentenceTransformer wrapper
│ ├── pipeline/ # The "Brain" (Process & Search flow)
│ └── vectordb/ # Store adapters (Chroma/Pinecone)
└── requirements.txt # Production deps
🤝 Contributing
We welcome issues and PRs!
- Report Bugs: Create an issue on GitHub.
- Updates: We are working on adding
QdrantandWeaviatesupport in v0.2.0.
Author: Yash Desai
License: MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectordbpipe-0.1.3.tar.gz.
File metadata
- Download URL: vectordbpipe-0.1.3.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77f44f1cbd8d0ee63f18bf969f71f3320303dd2d3d33ae138012fd6b5f37dfb9
|
|
| MD5 |
52fd64d7cde8b9d78e850d4532d09c42
|
|
| BLAKE2b-256 |
cd59b05965a6ba32a1fcc71f1f4e44d3e1f8a3541818f0688915cae76487ee73
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.3.tar.gz:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.3.tar.gz -
Subject digest:
77f44f1cbd8d0ee63f18bf969f71f3320303dd2d3d33ae138012fd6b5f37dfb9 - Sigstore transparency entry: 799227960
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@11b1cb304c8bda90fdf43d23125b092394449a7d -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@11b1cb304c8bda90fdf43d23125b092394449a7d -
Trigger Event:
release
-
Statement type:
File details
Details for the file vectordbpipe-0.1.3-py3-none-any.whl.
File metadata
- Download URL: vectordbpipe-0.1.3-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6af6e63346ade03bd55c0e6880123d84d85adb3d8c787ac4bcfb105709e51e8e
|
|
| MD5 |
d10aedf651436fcda937265159797d81
|
|
| BLAKE2b-256 |
ed23ea08ffd3a083f18a84271b70f1adde6392e9f2228eab4f922eba15eef83d
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.3-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.3-py3-none-any.whl -
Subject digest:
6af6e63346ade03bd55c0e6880123d84d85adb3d8c787ac4bcfb105709e51e8e - Sigstore transparency entry: 799227962
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@11b1cb304c8bda90fdf43d23125b092394449a7d -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@11b1cb304c8bda90fdf43d23125b092394449a7d -
Trigger Event:
release
-
Statement type: