A modular text embedding and vector database pipeline for local and cloud vector stores.
Project description
vectorDBpipe
A Modular, End-to-End RAG Pipeline for Production-Ready Vector Search.
vectorDBpipe is a robust framework designed to automate the heavy lifting of building RAG (Retrieval-Augmented Generation) systems. It seamlessly handles data ingestion, text cleaning, semantic embedding, and storage in modern vector databases.
🎯 Project Objectives
Building a vector search system often involves writing the same "glue code" over and over again:
- Parsing PDFs, Word docs, and Text files.
- Cleaning funny characters and whitespace.
- Chunking long text so it fits into context windows.
- Batching embeddings to avoid OOM (Out-of-Memory) errors.
- Creating and managing database indexes.
vectorDBpipe solves this. It is a "download-and-go" architected solution that reduces weeks of boilerplate work into a standardized config.yaml file.
Ideal for:
- AI Engineers building internal RAG tools.
- Developers needing to "chat with their data" instantly.
- Researchers testing different embedding models or databases (switch from Chroma to Pinecone in 1 line).
🖥️ Terminal UI (New!)
Prefer a visual interface? We now have a futuristic Terminal User Interface (TUI) to manage your pipelines interactively.
Installation
The TUI is a separate Node.js package that controls this Python backend.
npm install -g vectordbpipe-tui
Features
- Interactive Setup Wizard:
vdb setup - Visual Dashboard:
vdb start - Connector Manager:
vdb connectors(Manage S3, Notion, etc.)
🚀 Performance Benchmarks
Tested on: Python 3.11 | Dataset: 10,000 Paragraphs | Embedding Model: all-MiniLM-L6-v2
| Backend | Ingestion Rate (docs/sec) | Avg. Search Latency (ms) | Persistence |
|---|---|---|---|
| FAISS | ~240 | 12ms | In-Memory / Disk |
| ChromaDB | ~180 | 35ms | SQLite / Local |
| Pinecone | ~110 (Network Latency) | 120ms | Cloud-Native |
Analysis:
vectorDBpipeutilizes asynchronous batch processing to maintain a flat O(log n) search curve even as your knowledge base grows beyond 100k chunks.
🏗️ Production-Ready Features
- Scalable Batch Ingestion: Memory-safe processing that handles GBs of text without RAM spikes.
- Enterprise Error Handling: Graceful failover and retry logic for cloud vector store connections.
- Unified Adapter Pattern: Switch between local (FAISS) and cloud (Pinecone) by changing one line in
config.yaml. - Pre-Processor Suite: Built-in normalization, semantic chunking, and metadata injection for higher retrieval precision.
💡 Use Cases
1. Enterprise Knowledge Base
Company wikis, PDFs, and policy documents are scattered.
- Solution: Point
vectorDBpipeto the shared drive. It indexes 10,000+ docs into Pinecone. - Result: Employees get instant, semantic answers ("What is the travel policy?") instead of keyword search.
2. Legal / Medical Document Search
Long documents need to be split intelligently.
- Solution: Use the standardized chunker (e.g., 512 tokens with overlap).
- Result: Retrieval finds the exact paragraph containing the clause or diagnosis.
3. Rapid Prototype for RAG
You have a hackathon idea but don't want to spend 4 hours setting up FAISS.
- Solution:
pip install vectordbpipe->pipeline.run(). - Result: Working MVP in 5 minutes.
📦 Installation
Standard Installation
Install the package directly from PyPI:
pip install vectordbpipe
🔧 CPU-Optimized Installation (Windows/No-CUDA)
If you encounter WinError 1114 or DLL initialization errors with Torch, or if you run on a machine without a GPU, use the CPU-specific requirements:
- Download the
requirements-cpu.txtfrom the repo (or create one withtorch --index-url https://download.pytorch.org/whl/cpu). - Run:
pip install -r requirements-cpu.txt pip install vectordbpipe --no-deps
⚙️ Configuration Guide (config.yaml)
Control your entire pipeline via a config.yaml file. You can place this in your project root or pass the path explicitly.
# ---------------------------------------------------------
# 1. CORE PATHS
# ---------------------------------------------------------
paths:
data_dir: "data/" # Folder containing your .pdf, .txt, .docx, .html files
logs_dir: "logs/" # Where to save execution logs
# ---------------------------------------------------------
# 2. EMBEDDING MODEL
# ---------------------------------------------------------
model:
# HuggingFace Model ID (or OpenAI model name if provider is set)
name: "sentence-transformers/all-MiniLM-L6-v2"
batch_size: 32 # Number of chunks to embed at once (Higher = Faster, more RAM)
# ---------------------------------------------------------
# 3. VECTOR DATABASE
# ---------------------------------------------------------
vector_db:
type: "pinecone" # Options: "chroma", "pinecone", "faiss"
# For Pinecone:
index_name: "my-knowledge-base"
environment: "us-east-1" # (Optional for serverless)
# For ChromaDB (Local):
# type: "chroma"
# persist_directory: "data/chroma_store"
# ---------------------------------------------------------
# 4. LLM CONFIGURATION (Optional - for RAG generation)
# ---------------------------------------------------------
llm:
provider: "OpenAI" # Options: "OpenAI", "Gemini", "Groq", "Anthropic"
model_name: "gpt-4-turbo"
🔐 Authentication & Security
Do NOT hardcode API keys in config.yaml or your code. vectorDBpipe automatically detects environment variables.
Supported Environment Variables:
| Provider | Variable Name | Description |
|---|---|---|
| Pinecone | PINECONE_API_KEY |
Required if vector_db.type is pinecone. |
| OpenAI | OPENAI_API_KEY |
Required for OpenAI Embeddings or LLM. |
| Gemini | GOOGLE_API_KEY |
Required for Google Gemini models. |
| Groq | GROQ_API_KEY |
Required for Llama 3 via Groq. |
| HuggingFace | HF_TOKEN |
(Optional) For gated models. |
Setting Keys (Terminal):
Linux/Mac:
export PINECONE_API_KEY="pc-sk-..."
Windows PowerShell:
$env:PINECONE_API_KEY="pc-sk-..."
Python (.env file):
Create a .env file in your root and use python-dotenv:
from dotenv import load_dotenv
load_dotenv()
🚀 Usage
1. Ingest Data (The "Magic" Step)
This script detects all files in your data/ folder, cleans them, chunks them, embeds them, and uploads them to your DB.
from vectorDBpipe.pipeline.text_pipeline import TextPipeline
# Initialize (Automatically loads config.yaml if present)
pipeline = TextPipeline()
# Run the ETL process
# batch_size=100 means it uploads to DB every 100 chunks to verify progress
pipeline.process(batch_size=100)
print("✅ Ingestion Complete!")
2. Semantic Search
Query your database to find relevant context.
from vectorDBpipe.pipeline.text_pipeline import TextPipeline
pipeline = TextPipeline()
query = "What is the refund policy?"
results = pipeline.search(query, top_k=3)
print("--- Search Results ---")
for match in results:
print(f"📄 Source: {match.get('metadata', {}).get('source', 'Unknown')}")
print(f"📝 Text: {match.get('metadata', {}).get('text', '')[:200]}...")
print(f"⭐ Score: {match.get('score', 0):.4f}\n")
🧠 Features & Architecture
Supported File Types
- PDF (
.pdf): Extracts text usingPyMuPDF(fitz). - Word (
.docx): Parsing viapython-docx. - Text (
.txt,.md): Raw text ingestion. - HTML (
.html): Strips tags usingBeautifulSoup.
Smart Chunking
Instead of naive splitting, vectorDBpipe uses a Recursive Character Text Splitter:
- Chunk Size: 512 tokens (default, configurable).
- Overlap: 50 tokens (preserves context between chunks).
- Separators: Splits by Paragraph
\n\n, then Line\n, then Sentence., ensuring chunks are semantically complete.
Architecture Flow
graph LR
A[Raw Data Folder] --> B(DataLoader);
B --> C{Cleaner & Chunker};
C --Batching--> D[Embedder Model];
D --> E[(Vector Database)];
E --> F[Semantic Search API];
F --> G[RAG Application];
🔧 Troubleshooting
WinError 1114: A dynamic link library (DLL) initialization routine failed
- Cause: This usually happens on Windows when trying to run PyTorch (bundled with
sentence-transformers) on a machine without a breakdown of CUDA libraries, or conflictingintel-openmpversions. - Fix:
- Uninstall torch:
pip uninstall torch - Install CPU version:
pip install torch --index-url https://download.pytorch.org/whl/cpu
- Uninstall torch:
ModuleNotFoundError: No module named 'vectorDBpipe'
- Cause: You might be running the script outside the virtual environment or the package isn't installed.
- Fix: Ensure
pip install vectordbpipesucceeded.
Project Structure
vectorDBpipe/
├── benchmarks/ # Automated performance & precision tests
├── config/ # YAML configuration
├── data/ # Drop your raw files here
├── vectorDBpipe/
│ ├── data/ # Loader logic (PDF/DOCX/TXT parsers)
│ ├── embeddings/ # SentenceTransformer wrapper
│ ├── pipeline/ # The "Brain" (Process & Search flow)
│ └── vectordb/ # Store adapters (Chroma/Pinecone)
└── requirements.txt # Production deps
🤝 Contributing & Roadmap
We welcome issues and PRs!
- Report Bugs: Create an issue on GitHub.
- Roadmap:
- Pinecone v3.0 Support
- Next: Qdrant & Weaviate Integration (v0.2.0)
- Next: Reranker Layer (Cross-Encoder Support)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectordbpipe-0.1.9.tar.gz.
File metadata
- Download URL: vectordbpipe-0.1.9.tar.gz
- Upload date:
- Size: 25.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3495d2eeb11acacdd7b3e8f5fe47c183a7a105196da6578e9c62b1402604e1a
|
|
| MD5 |
e2a1993f4244f6a8db425fa8ca792b12
|
|
| BLAKE2b-256 |
304e1b9f1d211edcfc154d6989ae0fc716503f169e55ed0876c5726c11b8aacd
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.9.tar.gz:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.9.tar.gz -
Subject digest:
e3495d2eeb11acacdd7b3e8f5fe47c183a7a105196da6578e9c62b1402604e1a - Sigstore transparency entry: 962898794
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@16fd8e9217b1b7119df7abee30e77b19200fc483 -
Branch / Tag:
refs/tags/v0.1.9 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@16fd8e9217b1b7119df7abee30e77b19200fc483 -
Trigger Event:
release
-
Statement type:
File details
Details for the file vectordbpipe-0.1.9-py3-none-any.whl.
File metadata
- Download URL: vectordbpipe-0.1.9-py3-none-any.whl
- Upload date:
- Size: 24.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48fbc7c680cd453a67b196c786a9382910ae887032e69628ea87198142ff77f8
|
|
| MD5 |
f8af1b0f53e4715573b4bcc839cfb50c
|
|
| BLAKE2b-256 |
01f45abdfe75312005493ce3ee8af1ec05ec6740736c396925abe319677f2210
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.9-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.9-py3-none-any.whl -
Subject digest:
48fbc7c680cd453a67b196c786a9382910ae887032e69628ea87198142ff77f8 - Sigstore transparency entry: 962898803
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@16fd8e9217b1b7119df7abee30e77b19200fc483 -
Branch / Tag:
refs/tags/v0.1.9 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@16fd8e9217b1b7119df7abee30e77b19200fc483 -
Trigger Event:
release
-
Statement type: