A modular text embedding and vector database pipeline for local and cloud vector stores.
Project description
vectorDBpipe
Version: 0.1.2 Author: Yash Desai Email: desaisyash1000@gmail.com
Overview
vectorDBpipe is a modular Python framework designed to simplify the creation of text embedding and vector database pipelines.
It enables developers and researchers to efficiently process, embed, and retrieve large text datasets using modern vector databases such as FAISS, Chroma, or Pinecone.
The framework follows a layered, plug-and-play architecture, allowing easy customization of data loaders, embedding models, and storage backends.
Key Features
- Structured data ingestion, cleaning, and chunking
- Embedding generation via Sentence Transformers
- Pluggable vector storage engines: FAISS, Chroma, Pinecone
- Unified CRUD API for inserting, searching, updating, and deleting embeddings
- YAML-based configuration for quick workflow adjustments
- Integrated logging and exception handling
- End-to-end orchestration through a single pipeline interface
Installation
Install from PyPI:
pip install vectordbpipe
Or for local development:
git clone https://github.com/yashdesai023/vectorDBpipe.git
cd vectorDBpipe
pip install -e .
Configuration
The system reads settings from a YAML configuration file (config.yaml), which defines parameters for:
- Data sources (paths, formats)
- Embedding model (e.g.,
all-MiniLM-L6-v2) - Vector database backend (FAISS, Chroma, or Pinecone)
- Index parameters and persistence options
Pinecone Setup (Optional)
If you choose pinecone as your vector database, provide your API key as an environment variable.
macOS/Linux:
export PINECONE_API_KEY="your_api_key"
Windows PowerShell:
$env:PINECONE_API_KEY="your_api_key"
Quick Start
1. Data Loading and Embedding
from vectorDBpipe.data.loader import DataLoader
from vectorDBpipe.embeddings.embedder import Embedder
loader = DataLoader("data/")
documents = loader.load_all_files()
texts = [d["content"] for d in documents]
embedder = Embedder(model_name="sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedder.encode(texts)
print(f"Generated {len(embeddings)} embeddings with dimension {len(embeddings[0])}.")
2. Text Cleaning and Chunking
from vectorDBpipe.utils.common import clean_text, chunk_text
from vectorDBpipe.logger.logging import setup_logger
logger = setup_logger("Preprocess")
sample_text = "AI is transforming industries worldwide."
cleaned = clean_text(sample_text)
chunks = chunk_text(cleaned, chunk_size=50)
logger.info(f"Cleaned Text: {cleaned}")
logger.info(f"Generated {len(chunks)} chunks.")
3. Vector Storage and Retrieval
from vectorDBpipe.vectordb.store import VectorStore
store = VectorStore(backend="faiss", dim=384)
store.insert_vectors(texts, embeddings)
query = "Applications of Artificial Intelligence"
results = store.search_vectors(query, top_k=3)
print("Top Similar Results:")
for r in results:
print("-", r)
4. Full Pipeline Execution
from vectorDBpipe.pipeline.text_pipeline import TextPipeline
from vectorDBpipe.config.config_manager import ConfigManager
config = ConfigManager().get_config()
pipeline = TextPipeline(config)
results = pipeline.run(["Machine learning enables predictive analytics."],
query="What is machine learning?")
print(results)
Project Structure
vectorDBpipe/
│
├── vectorDBpipe/
│ ├── config/ # Configuration management
│ ├── data/ # Data loading and preprocessing
│ ├── embeddings/ # Embedding generation
│ ├── vectordb/ # Vector database abstraction layer
│ ├── pipeline/ # End-to-end workflow orchestration
│ ├── utils/ # Common utilities (cleaning, chunking)
│ └── logger/ # Logging utilities
│
├── tests/ # Unit tests
├── demo/ # Example Jupyter notebooks
├── setup.py
└── README.md
Logging and Error Handling
Every module integrates with a centralized logging system to track operations and debug efficiently.
from vectorDBpipe.logger.logging import setup_logger
logger = setup_logger("VectorDBPipe")
logger.info("Pipeline started successfully.")
Testing
Run the test suite to verify installation and functionality:
pytest -v --cov=vectorDBpipe
Coverage reports can be generated to ensure code reliability.
Example Notebook
A demonstration notebook vector_pipeline_demo.ipynb is included, showcasing:
- Document embedding and visualization
- Vector similarity retrieval
- PCA-based embedding visualization
You can also run it directly in Google Colab:
[Open in Colab](https://colab.research.google.com/github/yashdesai023/vectorDBpipe/blob/main/vector_pipeline_demo.ipynb)
Contributing
Contributions are welcome. Please ensure all pull requests include:
- Clear, modular code
- Type hints and docstrings
- Unit tests covering new functionality
Development Workflow
git checkout -b feature/my-feature
# Add your changes
pytest -v
git commit -m "Add new feature"
git push origin feature/my-feature
Then submit a pull request.
License
Distributed under the MIT License. See the LICENSE file for full terms.
Author & Contact
Yash Desai Computer Science & Engineering (AI) Email: desaisyash1000@gmail.com GitHub: yashdesai023
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectordbpipe-0.1.2.tar.gz.
File metadata
- Download URL: vectordbpipe-0.1.2.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4efec7d36243c3d098940da4edeca16b93180a8e985c3e40c39de8de225e4047
|
|
| MD5 |
a2bf724179df130ae8a60369f1dc43ca
|
|
| BLAKE2b-256 |
345993061894adede20f57c57aff8fbbc76914a7a3a71b4a510242db6353fc3a
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.2.tar.gz:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.2.tar.gz -
Subject digest:
4efec7d36243c3d098940da4edeca16b93180a8e985c3e40c39de8de225e4047 - Sigstore transparency entry: 599101267
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@b1435972e81fdf1dc580f55282228342308dd54c -
Branch / Tag:
refs/tags/v0.1.2.1 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@b1435972e81fdf1dc580f55282228342308dd54c -
Trigger Event:
release
-
Statement type:
File details
Details for the file vectordbpipe-0.1.2-py3-none-any.whl.
File metadata
- Download URL: vectordbpipe-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
693f037190ae0d84d707d562df8b1168c0ffdb5a4389f3311104d2dc0bb3c4f4
|
|
| MD5 |
029bbe81b4f2f1ca979e71d5e34010c6
|
|
| BLAKE2b-256 |
99c3c8033ccf2c3ced6b9d9d53eb609784e2986c69be7a42123753981dc37a19
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.2-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.2-py3-none-any.whl -
Subject digest:
693f037190ae0d84d707d562df8b1168c0ffdb5a4389f3311104d2dc0bb3c4f4 - Sigstore transparency entry: 599101314
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@b1435972e81fdf1dc580f55282228342308dd54c -
Branch / Tag:
refs/tags/v0.1.2.1 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@b1435972e81fdf1dc580f55282228342308dd54c -
Trigger Event:
release
-
Statement type: