A modular text embedding and vector database pipeline for local and cloud vector stores.
Project description
vectorDBpipe
Version: 0.1.0
Author: Yash Desai
Email: desaisyash1000@gmail.com
A modular text embedding and vector database pipeline for local and cloud vector stores.
Designed to streamline text preprocessing, embedding generation, and semantic search with multiple backends such as FAISS, Chroma, and Pinecone.
🚀 Features
- Load text from files or directories
- Clean and preprocess text efficiently
- Chunk text for large documents
- Generate embeddings with Sentence Transformers
- Store and retrieve embeddings using local (FAISS, Chroma) or cloud (Pinecone) vector databases
- Integrated logging for pipeline operations
- Fully modular and extendable design
💻 Installation
Install vectorDBpipe directly from PyPI:
pip install vectorDBpipe
⚙️ Configuration
vectorDBpipe uses a config.yaml file for configuration. You can customize paths, models, and vector database settings.
Pinecone API Key
If you use the pinecone vector database, you must provide your API key via an environment variable. The library will automatically load it.
Linux/macOS:
export PINECONE_API_KEY="YOUR_API_KEY"
Windows:
$env:PINECONE_API_KEY="YOUR_API_KEY"
⚙️ Basic Usage
1️⃣ Load Data and Generate Embeddings
from vectorDBpipe.data.loader import DataLoader
from vectorDBpipe.embeddings.embedder import Embedder
# Load all text files from a directory
loader = DataLoader("data/")
data = loader.load_all_files()
# Extract text contents
texts = [d["content"] for d in data]
# Create embeddings
embedder = Embedder()
vectors = embedder.encode(texts)
print("Vectors shape:", vectors.shape)
2️⃣ Text Cleaning and Chunking
from vectorDBpipe.logger.logging import setup_logger
from vectorDBpipe.utils.common import clean_text, chunk_text
logger = setup_logger("TextPipeline")
text = "AI is transforming the world!"
cleaned = clean_text(text)
chunks = chunk_text(cleaned, chunk_size=50)
logger.info(f"Cleaned text: {cleaned}")
logger.info(f"Generated {len(chunks)} chunks.")
Output Example:
INFO:TextPipeline: Cleaned text: ai is transforming the world!
INFO:TextPipeline: Generated 1 chunks.
3️⃣ Modular Vector Storage & Retrieval
from vectorDBpipe.vectorstore.faiss_store import FAISSVectorStore
# Initialize vector store
vector_store = FAISSVectorStore(dim=384)
# Add embeddings and metadata
metadata = [{"text": t} for t in texts]
vector_store.add(vectors, metadata)
# Search similar text
query = "Artificial Intelligence"
results = vector_store.search(query, top_k=3)
print("Search results:", results)
📝 Project Structure
vectorDBpipe/
├── data/ # Example dataset
├── vectorDBpipe/
│ ├── data/loader.py # Data loading module
│ ├── embeddings/embedder.py # Embedding generation
│ ├── vectorstore/ # Vector DB modules (FAISS, Chroma, Pinecone)
│ ├── logger/ # Logging setup
│ └── utils/ # Helper functions (cleaning, chunking, etc.)
├── tests/ # Unit tests
├── demo/ # Demo Jupyter notebooks
├── setup.py
└── README.md
📒 Logging & Debugging
- Use
setup_logger()to create named loggers for your pipeline. - Logs capture preprocessing, embedding, and vector store operations for easier debugging.
logger = setup_logger("TextPipeline")
logger.info("Pipeline started...")
✅ Contribution Guide
- Fork the repository
- Create a branch:
git checkout -b feature/my-feature - Add or modify code with proper docstrings and type hints
- Add tests under
tests/ - Submit a Pull Request with a detailed description
📖 Demo Notebooks
demo/TextPipeline_demo.ipynb: Step-by-step demonstration of data loading, preprocessing, embedding, storage, and search.- Visualize similarity search results using
pandasormatplotlib.
📜 License
This project is licensed under the MIT License.
See LICENSE for more details.
🔗 Contact
Author: Yash Desai
Email: desaisyash1000@gmail.com
GitHub: https://github.com/yashdesai023/vectorDBpipe
Ready for contributions and feedback! If you need a polished demo notebook, let me know!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectordbpipe-0.1.1.tar.gz.
File metadata
- Download URL: vectordbpipe-0.1.1.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1480f4f9367789bca0c3efa5969b61a5dfd1592eab08bb38512534a89b6e643c
|
|
| MD5 |
b9fcb8cf949eb55ad71b84d797d1ba03
|
|
| BLAKE2b-256 |
cdc14bf16aecd1c0fc34f37d96ddc1eb382354b8d072d918ebae15f0e32366ab
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.1.tar.gz:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.1.tar.gz -
Subject digest:
1480f4f9367789bca0c3efa5969b61a5dfd1592eab08bb38512534a89b6e643c - Sigstore transparency entry: 598505227
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@dc253444becdc8078f159216f0da57317cfba501 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@dc253444becdc8078f159216f0da57317cfba501 -
Trigger Event:
release
-
Statement type:
File details
Details for the file vectordbpipe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: vectordbpipe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
409c6f5f1aa4a0860d2e62032be1057d761ceecc3cb0f2692b63c6f71fc9697b
|
|
| MD5 |
4741ace1ec998ff57d4ab191fd82c9c2
|
|
| BLAKE2b-256 |
4128c8fb75c2e8f9637ab01c8caf5fa3fd9d05e7e22e7ea0e03fc94b9b502bb7
|
Provenance
The following attestation bundles were made for vectordbpipe-0.1.1-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on yashdesai023/vectorDBpipe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vectordbpipe-0.1.1-py3-none-any.whl -
Subject digest:
409c6f5f1aa4a0860d2e62032be1057d761ceecc3cb0f2692b63c6f71fc9697b - Sigstore transparency entry: 598505237
- Sigstore integration time:
-
Permalink:
yashdesai023/vectorDBpipe@dc253444becdc8078f159216f0da57317cfba501 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yashdesai023
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@dc253444becdc8078f159216f0da57317cfba501 -
Trigger Event:
release
-
Statement type: