A modular text embedding and vector database pipeline for local and cloud vector stores.

These details have not been verified by PyPI

Project description

vectorDBpipe

A Modular, End-to-End RAG Pipeline for Production-Ready Vector Search.

vectorDBpipe is a robust framework designed to automate the heavy lifting of building RAG (Retrieval-Augmented Generation) systems. It seamlessly handles data ingestion, text cleaning, semantic embedding, and storage in modern vector databases.

🎯 Project Objectives

Building a vector search system often involves writing the same "glue code" over and over again:

Parsing PDFs, Word docs, and Text files.
Cleaning funny characters and whitespace.
Chunking long text so it fits into context windows.
Batching embeddings to avoid OOM (Out-of-Memory) errors.
Creating and managing database indexes.

vectorDBpipe solves this. It is a "download-and-go" architected solution that reduces weeks of boilerplate work into a standardized config.yaml file.

Ideal for:

AI Engineers building internal RAG tools.
Developers needing to "chat with their data" instantly.
Researchers testing different embedding models or databases (switch from Chroma to Pinecone in 1 line).

🖥️ Terminal UI (New!)

Prefer a visual interface? We now have a futuristic Terminal User Interface (TUI) to manage your pipelines interactively.

TUI Demo

Installation

The TUI is a separate Node.js package that controls this Python backend.

npm install -g vectordbpipe-tui

Features

Interactive Setup Wizard: vdb setup
Visual Dashboard: vdb start
Connector Manager: vdb connectors (Manage S3, Notion, etc.)

🚀 Performance Benchmarks

Tested on: Python 3.11 | Dataset: 10,000 Paragraphs | Embedding Model: all-MiniLM-L6-v2

Backend	Ingestion Rate (docs/sec)	Avg. Search Latency (ms)	Persistence
FAISS	~240	12ms	In-Memory / Disk
ChromaDB	~180	35ms	SQLite / Local
Pinecone	~110 (Network Latency)	120ms	Cloud-Native

Analysis: vectorDBpipe utilizes asynchronous batch processing to maintain a flat O(log n) search curve even as your knowledge base grows beyond 100k chunks.

🏗️ Production-Ready Features

Scalable Batch Ingestion: Memory-safe processing that handles GBs of text without RAM spikes.
Enterprise Error Handling: Graceful failover and retry logic for cloud vector store connections.
Unified Adapter Pattern: Switch between local (FAISS) and cloud (Pinecone) by changing one line in config.yaml.
Pre-Processor Suite: Built-in normalization, semantic chunking, and metadata injection for higher retrieval precision.

💡 Use Cases

1. Enterprise Knowledge Base

Company wikis, PDFs, and policy documents are scattered.

Solution: Point vectorDBpipe to the shared drive. It indexes 10,000+ docs into Pinecone.
Result: Employees get instant, semantic answers ("What is the travel policy?") instead of keyword search.

2. Legal / Medical Document Search

Long documents need to be split intelligently.

Solution: Use the standardized chunker (e.g., 512 tokens with overlap).
Result: Retrieval finds the exact paragraph containing the clause or diagnosis.

3. Rapid Prototype for RAG

You have a hackathon idea but don't want to spend 4 hours setting up FAISS.

Solution: pip install vectordbpipe -> pipeline.run().
Result: Working MVP in 5 minutes.

📦 Installation

Standard Installation

Install the package directly from PyPI:

pip install vectordbpipe

🔧 CPU-Optimized Installation (Windows/No-CUDA)

If you encounter WinError 1114 or DLL initialization errors with Torch, or if you run on a machine without a GPU, use the CPU-specific requirements:

Download the requirements-cpu.txt from the repo (or create one with torch --index-url https://download.pytorch.org/whl/cpu).

Run:

pip install -r requirements-cpu.txt
pip install vectordbpipe --no-deps

⚙️ Configuration Guide (`config.yaml`)

Control your entire pipeline via a config.yaml file. You can place this in your project root or pass the path explicitly.

# ---------------------------------------------------------
# 1. CORE PATHS
# ---------------------------------------------------------
paths:
  data_dir: "data/"             # Folder containing your .pdf, .txt, .docx, .html files
  logs_dir: "logs/"             # Where to save execution logs

# ---------------------------------------------------------
# 2. EMBEDDING MODEL
# ---------------------------------------------------------
model:
  # HuggingFace Model ID (or OpenAI model name if provider is set)
  name: "sentence-transformers/all-MiniLM-L6-v2" 
  batch_size: 32                # Number of chunks to embed at once (Higher = Faster, more RAM)

# ---------------------------------------------------------
# 3. VECTOR DATABASE
# ---------------------------------------------------------
vector_db:
  type: "pinecone"              # Options: "chroma", "pinecone", "faiss"
  
  # For Pinecone:
  index_name: "my-knowledge-base"
  environment: "us-east-1"      # (Optional for serverless)
  
  # For ChromaDB (Local):
  # type: "chroma"
  # persist_directory: "data/chroma_store"

# ---------------------------------------------------------
# 4. LLM CONFIGURATION (Optional - for RAG generation)
# ---------------------------------------------------------
llm:
  provider: "OpenAI"            # Options: "OpenAI", "Gemini", "Groq", "Anthropic"
  model_name: "gpt-4-turbo"

🔐 Authentication & Security

Do NOT hardcode API keys in config.yaml or your code. vectorDBpipe automatically detects environment variables.

Supported Environment Variables:

Provider	Variable Name	Description
Pinecone	`PINECONE_API_KEY`	Required if `vector_db.type` is `pinecone`.
OpenAI	`OPENAI_API_KEY`	Required for OpenAI Embeddings or LLM.
Gemini	`GOOGLE_API_KEY`	Required for Google Gemini models.
Groq	`GROQ_API_KEY`	Required for Llama 3 via Groq.
HuggingFace	`HF_TOKEN`	(Optional) For gated models.

Setting Keys (Terminal):

Linux/Mac:

export PINECONE_API_KEY="pc-sk-..."

Windows PowerShell:

$env:PINECONE_API_KEY="pc-sk-..."

Python (.env file): Create a .env file in your root and use python-dotenv:

from dotenv import load_dotenv
load_dotenv()

🚀 Usage

1. Ingest Data (The "Magic" Step)

This script detects all files in your data/ folder, cleans them, chunks them, embeds them, and uploads them to your DB.

from vectorDBpipe.pipeline.text_pipeline import TextPipeline

# Initialize (Automatically loads config.yaml if present)
pipeline = TextPipeline()

# Run the ETL process
# batch_size=100 means it uploads to DB every 100 chunks to verify progress
pipeline.process(batch_size=100)

print("✅ Ingestion Complete!")

2. Semantic Search

Query your database to find relevant context.

from vectorDBpipe.pipeline.text_pipeline import TextPipeline

pipeline = TextPipeline()

query = "What is the refund policy?"
results = pipeline.search(query, top_k=3)

print("--- Search Results ---")
for match in results:
    print(f"📄 Source: {match.get('metadata', {}).get('source', 'Unknown')}")
    print(f"📝 Text: {match.get('metadata', {}).get('text', '')[:200]}...")
    print(f"⭐ Score: {match.get('score', 0):.4f}\n")

🧠 Features & Architecture

Supported File Types

PDF (.pdf): Extracts text using PyMuPDF (fitz).
Word (.docx): Parsing via python-docx.
Text (.txt, .md): Raw text ingestion.
HTML (.html): Strips tags using BeautifulSoup.

Smart Chunking

Instead of naive splitting, vectorDBpipe uses a Recursive Character Text Splitter:

Chunk Size: 512 tokens (default, configurable).
Overlap: 50 tokens (preserves context between chunks).
Separators: Splits by Paragraph \n\n, then Line \n, then Sentence . , ensuring chunks are semantically complete.

Architecture Flow

graph LR
    A[Raw Data Folder] --> B(DataLoader);
    B --> C{Cleaner & Chunker};
    C --Batching--> D[Embedder Model];
    D --> E[(Vector Database)];
    E --> F[Semantic Search API];
    F --> G[RAG Application];

🔧 Troubleshooting

`WinError 1114: A dynamic link library (DLL) initialization routine failed`

Cause: This usually happens on Windows when trying to run PyTorch (bundled with sentence-transformers) on a machine without a breakdown of CUDA libraries, or conflicting intel-openmp versions.
Fix:
1. Uninstall torch: pip uninstall torch
2. Install CPU version: pip install torch --index-url https://download.pytorch.org/whl/cpu

`ModuleNotFoundError: No module named 'vectorDBpipe'`

Cause: You might be running the script outside the virtual environment or the package isn't installed.
Fix: Ensure pip install vectordbpipe succeeded.

Project Structure

vectorDBpipe/
├── benchmarks/         # Automated performance & precision tests
├── config/             # YAML configuration
├── data/               # Drop your raw files here
├── vectorDBpipe/
│   ├── data/           # Loader logic (PDF/DOCX/TXT parsers)
│   ├── embeddings/     # SentenceTransformer wrapper
│   ├── pipeline/       # The "Brain" (Process & Search flow)
│   └── vectordb/       # Store adapters (Chroma/Pinecone)
└── requirements.txt    # Production deps

🤝 Contributing & Roadmap

We welcome issues and PRs!

Report Bugs: Create an issue on GitHub.
Roadmap:
- Pinecone v3.0 Support
- Next: Qdrant & Weaviate Integration (v0.2.0)
- Next: Reranker Layer (Cross-Encoder Support)

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.4

Mar 3, 2026

0.2.3

Mar 3, 2026

0.2.2

Mar 1, 2026

0.2.1

Feb 28, 2026

0.2.0

Feb 28, 2026

This version

0.1.9

Feb 18, 2026

0.1.6

Jan 7, 2026

0.1.5

Jan 7, 2026

0.1.4

Jan 7, 2026

0.1.3

Jan 7, 2026

0.1.2

Oct 10, 2025

0.1.1

Oct 10, 2025

0.1.0

Oct 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectordbpipe-0.1.9.tar.gz (25.8 kB view details)

Uploaded Feb 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vectordbpipe-0.1.9-py3-none-any.whl (24.0 kB view details)

Uploaded Feb 18, 2026 Python 3

File details

Details for the file vectordbpipe-0.1.9.tar.gz.

File metadata

Download URL: vectordbpipe-0.1.9.tar.gz
Upload date: Feb 18, 2026
Size: 25.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectordbpipe-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`e3495d2eeb11acacdd7b3e8f5fe47c183a7a105196da6578e9c62b1402604e1a`
MD5	`e2a1993f4244f6a8db425fa8ca792b12`
BLAKE2b-256	`304e1b9f1d211edcfc154d6989ae0fc716503f169e55ed0876c5726c11b8aacd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectordbpipe-0.1.9.tar.gz:

Publisher: publish-to-pypi.yml on yashdesai023/vectorDBpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vectordbpipe-0.1.9.tar.gz
- Subject digest: e3495d2eeb11acacdd7b3e8f5fe47c183a7a105196da6578e9c62b1402604e1a
- Sigstore transparency entry: 962898794
- Sigstore integration time: Feb 18, 2026
Source repository:
- Permalink: yashdesai023/vectorDBpipe@16fd8e9217b1b7119df7abee30e77b19200fc483
- Branch / Tag: refs/tags/v0.1.9
- Owner: https://github.com/yashdesai023
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@16fd8e9217b1b7119df7abee30e77b19200fc483
- Trigger Event: release

File details

Details for the file vectordbpipe-0.1.9-py3-none-any.whl.

File metadata

Download URL: vectordbpipe-0.1.9-py3-none-any.whl
Upload date: Feb 18, 2026
Size: 24.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vectordbpipe-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`48fbc7c680cd453a67b196c786a9382910ae887032e69628ea87198142ff77f8`
MD5	`f8af1b0f53e4715573b4bcc839cfb50c`
BLAKE2b-256	`01f45abdfe75312005493ce3ee8af1ec05ec6740736c396925abe319677f2210`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vectordbpipe-0.1.9-py3-none-any.whl:

Publisher: publish-to-pypi.yml on yashdesai023/vectorDBpipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vectordbpipe-0.1.9-py3-none-any.whl
- Subject digest: 48fbc7c680cd453a67b196c786a9382910ae887032e69628ea87198142ff77f8
- Sigstore transparency entry: 962898803
- Sigstore integration time: Feb 18, 2026
Source repository:
- Permalink: yashdesai023/vectorDBpipe@16fd8e9217b1b7119df7abee30e77b19200fc483
- Branch / Tag: refs/tags/v0.1.9
- Owner: https://github.com/yashdesai023
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@16fd8e9217b1b7119df7abee30e77b19200fc483
- Trigger Event: release

vectordbpipe 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

vectorDBpipe

🎯 Project Objectives

🖥️ Terminal UI (New!)

Installation

Features

🚀 Performance Benchmarks

🏗️ Production-Ready Features

💡 Use Cases

1. Enterprise Knowledge Base

2. Legal / Medical Document Search

3. Rapid Prototype for RAG

📦 Installation

Standard Installation

🔧 CPU-Optimized Installation (Windows/No-CUDA)

⚙️ Configuration Guide (config.yaml)

🔐 Authentication & Security

Supported Environment Variables:

Setting Keys (Terminal):

🚀 Usage

1. Ingest Data (The "Magic" Step)

2. Semantic Search

🧠 Features & Architecture

Supported File Types

Smart Chunking

Architecture Flow

🔧 Troubleshooting

WinError 1114: A dynamic link library (DLL) initialization routine failed

ModuleNotFoundError: No module named 'vectorDBpipe'

Project Structure

🤝 Contributing & Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

⚙️ Configuration Guide (`config.yaml`)

`WinError 1114: A dynamic link library (DLL) initialization routine failed`

`ModuleNotFoundError: No module named 'vectorDBpipe'`