Skip to main content

A modular text embedding and vector database pipeline for local and cloud vector stores.

Project description

vectorDBpipe

PyPI version Python 3.10+ License: MIT

Pinecone ChromaDB HuggingFace FAISS

A Modular, End-to-End RAG Pipeline for Production-Ready Vector Search.

vectorDBpipe is a robust framework designed to automate the heavy lifting of building RAG (Retrieval-Augmented Generation) systems. It seamlessly handles data ingestion, text cleaning, semantic embedding, and storage in modern vector databases.


🎯 Project Objectives

Building a vector search system often involves writing the same "glue code" over and over again:

  1. Parsing PDFs, Word docs, and Text files.
  2. Cleaning funny characters and whitespace.
  3. Chunking long text so it fits into context windows.
  4. Batching embeddings to avoid OOM (Out-of-Memory) errors.
  5. Creating and managing database indexes.

vectorDBpipe solves this. It is a "download-and-go" architected solution that reduces weeks of boilerplate work into a standardized config.yaml file.

Ideal for:

  • AI Engineers building internal RAG tools.
  • Developers needing to "chat with their data" instantly.
  • Researchers testing different embedding models or databases (switch from Chroma to Pinecone in 1 line).

🖥️ Terminal UI (New!)

Prefer a visual interface? We now have a futuristic Terminal User Interface (TUI) to manage your pipelines interactively.

TUI Demo

Installation

The TUI is a separate Node.js package that controls this Python backend.

npm install -g vectordbpipe-tui

Features

  • Interactive Setup Wizard: vdb setup
  • Visual Dashboard: vdb start
  • Connector Manager: vdb connectors (Manage S3, Notion, etc.)

🚀 Performance Benchmarks

Tested on: Python 3.11 | Dataset: 10,000 Paragraphs | Embedding Model: all-MiniLM-L6-v2

Backend Ingestion Rate (docs/sec) Avg. Search Latency (ms) Persistence
FAISS ~240 12ms In-Memory / Disk
ChromaDB ~180 35ms SQLite / Local
Pinecone ~110 (Network Latency) 120ms Cloud-Native

Analysis: vectorDBpipe utilizes asynchronous batch processing to maintain a flat O(log n) search curve even as your knowledge base grows beyond 100k chunks.


🏗️ Production-Ready Features

  • Scalable Batch Ingestion: Memory-safe processing that handles GBs of text without RAM spikes.
  • Enterprise Error Handling: Graceful failover and retry logic for cloud vector store connections.
  • Unified Adapter Pattern: Switch between local (FAISS) and cloud (Pinecone) by changing one line in config.yaml.
  • Pre-Processor Suite: Built-in normalization, semantic chunking, and metadata injection for higher retrieval precision.

💡 Use Cases

1. Enterprise Knowledge Base

Company wikis, PDFs, and policy documents are scattered.

  • Solution: Point vectorDBpipe to the shared drive. It indexes 10,000+ docs into Pinecone.
  • Result: Employees get instant, semantic answers ("What is the travel policy?") instead of keyword search.

2. Legal / Medical Document Search

Long documents need to be split intelligently.

  • Solution: Use the standardized chunker (e.g., 512 tokens with overlap).
  • Result: Retrieval finds the exact paragraph containing the clause or diagnosis.

3. Rapid Prototype for RAG

You have a hackathon idea but don't want to spend 4 hours setting up FAISS.

  • Solution: pip install vectordbpipe -> pipeline.run().
  • Result: Working MVP in 5 minutes.

📦 Installation

Standard Installation

Install the package directly from PyPI:

pip install vectordbpipe

🔧 CPU-Optimized Installation (Windows/No-CUDA)

If you encounter WinError 1114 or DLL initialization errors with Torch, or if you run on a machine without a GPU, use the CPU-specific requirements:

  1. Download the requirements-cpu.txt from the repo (or create one with torch --index-url https://download.pytorch.org/whl/cpu).
  2. Run:
    pip install -r requirements-cpu.txt
    pip install vectordbpipe --no-deps
    

⚙️ Configuration Guide (config.yaml)

Control your entire pipeline via a config.yaml file. You can place this in your project root or pass the path explicitly.

# ---------------------------------------------------------
# 1. CORE PATHS
# ---------------------------------------------------------
paths:
  data_dir: "data/"             # Folder containing your .pdf, .txt, .docx, .html files
  logs_dir: "logs/"             # Where to save execution logs

# ---------------------------------------------------------
# 2. EMBEDDING MODEL
# ---------------------------------------------------------
model:
  # HuggingFace Model ID (or OpenAI model name if provider is set)
  name: "sentence-transformers/all-MiniLM-L6-v2" 
  batch_size: 32                # Number of chunks to embed at once (Higher = Faster, more RAM)

# ---------------------------------------------------------
# 3. VECTOR DATABASE
# ---------------------------------------------------------
vector_db:
  type: "pinecone"              # Options: "chroma", "pinecone", "faiss"
  
  # For Pinecone:
  index_name: "my-knowledge-base"
  environment: "us-east-1"      # (Optional for serverless)
  
  # For ChromaDB (Local):
  # type: "chroma"
  # persist_directory: "data/chroma_store"

# ---------------------------------------------------------
# 4. LLM CONFIGURATION (Optional - for RAG generation)
# ---------------------------------------------------------
llm:
  provider: "OpenAI"            # Options: "OpenAI", "Gemini", "Groq", "Anthropic"
  model_name: "gpt-4-turbo"

🔐 Authentication & Security

Do NOT hardcode API keys in config.yaml or your code. vectorDBpipe automatically detects environment variables.

Supported Environment Variables:

Provider Variable Name Description
Pinecone PINECONE_API_KEY Required if vector_db.type is pinecone.
OpenAI OPENAI_API_KEY Required for OpenAI Embeddings or LLM.
Gemini GOOGLE_API_KEY Required for Google Gemini models.
Groq GROQ_API_KEY Required for Llama 3 via Groq.
HuggingFace HF_TOKEN (Optional) For gated models.

Setting Keys (Terminal):

Linux/Mac:

export PINECONE_API_KEY="pc-sk-..."

Windows PowerShell:

$env:PINECONE_API_KEY="pc-sk-..."

Python (.env file): Create a .env file in your root and use python-dotenv:

from dotenv import load_dotenv
load_dotenv()

🚀 Usage

1. Ingest Data (The "Magic" Step)

This script detects all files in your data/ folder, cleans them, chunks them, embeds them, and uploads them to your DB.

from vectorDBpipe.pipeline.text_pipeline import TextPipeline

# Initialize (Automatically loads config.yaml if present)
pipeline = TextPipeline()

# Run the ETL process
# batch_size=100 means it uploads to DB every 100 chunks to verify progress
pipeline.process(batch_size=100)

print("✅ Ingestion Complete!")

2. Semantic Search

Query your database to find relevant context.

from vectorDBpipe.pipeline.text_pipeline import TextPipeline

pipeline = TextPipeline()

query = "What is the refund policy?"
results = pipeline.search(query, top_k=3)

print("--- Search Results ---")
for match in results:
    print(f"📄 Source: {match.get('metadata', {}).get('source', 'Unknown')}")
    print(f"📝 Text: {match.get('metadata', {}).get('text', '')[:200]}...")
    print(f"⭐ Score: {match.get('score', 0):.4f}\n")

🧠 Features & Architecture

Supported File Types

  • PDF (.pdf): Extracts text using PyMuPDF (fitz).
  • Word (.docx): Parsing via python-docx.
  • Text (.txt, .md): Raw text ingestion.
  • HTML (.html): Strips tags using BeautifulSoup.

Smart Chunking

Instead of naive splitting, vectorDBpipe uses a Recursive Character Text Splitter:

  • Chunk Size: 512 tokens (default, configurable).
  • Overlap: 50 tokens (preserves context between chunks).
  • Separators: Splits by Paragraph \n\n, then Line \n, then Sentence . , ensuring chunks are semantically complete.

Architecture Flow

graph LR
    A[Raw Data Folder] --> B(DataLoader);
    B --> C{Cleaner & Chunker};
    C --Batching--> D[Embedder Model];
    D --> E[(Vector Database)];
    E --> F[Semantic Search API];
    F --> G[RAG Application];

🔧 Troubleshooting

WinError 1114: A dynamic link library (DLL) initialization routine failed

  • Cause: This usually happens on Windows when trying to run PyTorch (bundled with sentence-transformers) on a machine without a breakdown of CUDA libraries, or conflicting intel-openmp versions.
  • Fix:
    1. Uninstall torch: pip uninstall torch
    2. Install CPU version: pip install torch --index-url https://download.pytorch.org/whl/cpu

ModuleNotFoundError: No module named 'vectorDBpipe'

  • Cause: You might be running the script outside the virtual environment or the package isn't installed.
  • Fix: Ensure pip install vectordbpipe succeeded.

Project Structure

vectorDBpipe/
├── benchmarks/         # Automated performance & precision tests
├── config/             # YAML configuration
├── data/               # Drop your raw files here
├── vectorDBpipe/
│   ├── data/           # Loader logic (PDF/DOCX/TXT parsers)   ├── embeddings/     # SentenceTransformer wrapper   ├── pipeline/       # The "Brain" (Process & Search flow)   └── vectordb/       # Store adapters (Chroma/Pinecone)
└── requirements.txt    # Production deps

🤝 Contributing & Roadmap

We welcome issues and PRs!

  • Report Bugs: Create an issue on GitHub.
  • Roadmap:
    • Pinecone v3.0 Support
    • Next: Qdrant & Weaviate Integration (v0.2.0)
    • Next: Reranker Layer (Cross-Encoder Support)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectordbpipe-0.1.9.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vectordbpipe-0.1.9-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file vectordbpipe-0.1.9.tar.gz.

File metadata

  • Download URL: vectordbpipe-0.1.9.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectordbpipe-0.1.9.tar.gz
Algorithm Hash digest
SHA256 e3495d2eeb11acacdd7b3e8f5fe47c183a7a105196da6578e9c62b1402604e1a
MD5 e2a1993f4244f6a8db425fa8ca792b12
BLAKE2b-256 304e1b9f1d211edcfc154d6989ae0fc716503f169e55ed0876c5726c11b8aacd

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectordbpipe-0.1.9.tar.gz:

Publisher: publish-to-pypi.yml on yashdesai023/vectorDBpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vectordbpipe-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: vectordbpipe-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectordbpipe-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 48fbc7c680cd453a67b196c786a9382910ae887032e69628ea87198142ff77f8
MD5 f8af1b0f53e4715573b4bcc839cfb50c
BLAKE2b-256 01f45abdfe75312005493ce3ee8af1ec05ec6740736c396925abe319677f2210

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectordbpipe-0.1.9-py3-none-any.whl:

Publisher: publish-to-pypi.yml on yashdesai023/vectorDBpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page