Chunk Embed Store - Chunk & Embed Knowledge Base
Project description
Chunk Embed Store
A powerful tool for chunking, embedding, and storing documents in a vector database for efficient retrieval and semantic search capabilities.
🔍 Overview
Chunk Embed Store is a Python tool designed to process, chunk, and embed text documents into a vector database. It analyzes your code base or documentation, breaks it into manageable chunks, generates semantic embeddings, and stores them in a ChromaDB vector database for later retrieval.
✨ Features
- Intelligent Document Processing: Supports a wide variety of file types (code files, documentation, text files, etc.)
- Smart Chunking: Breaks documents into optimal chunks while preserving context
- High-Quality Embeddings: Uses Sentence Transformers to generate semantic embeddings
- Efficient Storage: Stores document chunks and embeddings in ChromaDB for fast retrieval
- Memory-Aware Processing: Dynamically adjusts batch size based on available memory
- Resilient Operation: Includes retry mechanisms and error handling
- Progress Tracking: Shows real-time progress with helpful statistics
📋 Prerequisites
- Python 3.10 or higher
- A pre-trained Sentence Transformer model
- Sufficient disk space for the vector database
- Sufficient RAM for processing (8GB recommended)
🔧 Installation
From PyPI
pip install chunk-embed
From Source
git clone https://github.com/yourusername/chunk-embed-store.git
cd chunk-embed-store
pip install -e .
💻 Usage
Basic Usage
chunk-embed \
--base_dir "/path/to/your/documents" \
--collection_name "your-collection" \
--embedding_model_name "/path/to/embedding/model" \
--persist_dir "/path/to/store/database"
With UVX
You can also run the tool with UVX for an enhanced experience:
uvx chunk-embed \
--base_dir "/path/to/your/documents" \
--collection_name "your-collection" \
--embedding_model_name "/path/to/embedding/model" \
--persist_dir "/path/to/store/database"
All Options
| Parameter | Required | Default | Description |
|---|---|---|---|
--base_dir |
Yes | - | Directory containing documents to process |
--collection_name |
Yes | - | Name for the ChromaDB collection |
--embedding_model_name |
Yes | - | Path to Sentence Transformer model |
--persist_dir |
Yes | - | Directory to store the vector database |
--chunk_size |
No | 500 | Size of text chunks |
--chunk_overlap |
No | 50 | Overlap between text chunks |
--batch_size |
No | 1000 | Batch size for processing |
--memory_threshold_mb |
No | 8000 | Memory threshold in MB |
--max_retries |
No | 3 | Maximum retries for embedding generation |
📊 Example
Process a code repository and store embeddings in a local database:
chunk-embed \
--base_dir "/Users/username/projects/my-repo" \
--collection_name "my-repo-knowledge" \
--embedding_model_name "/Users/username/models/all-MiniLM-L6-v2" \
--persist_dir "/Users/username/vector-db/my-repo-db" \
--chunk_size 600 \
--chunk_overlap 75
📁 Supported File Types
The tool processes a wide variety of file types including:
- Code files:
.py,.js,.java,.cpp,.go, etc. - Documentation:
.md,.txt,.html, etc. - Configuration:
.yml,.json,.toml, etc. - And many more (over 100 file extensions supported)
🔄 How It Works
- Loading: The tool recursively explores your specified directory and loads files with supported extensions.
- Chunking: Each document is split into smaller chunks with configurable size and overlap, trying to respect natural boundaries like line breaks.
- Embedding: The chunks are processed through a Sentence Transformer model to generate vector embeddings.
- Storage: The chunks and their embeddings are stored in a ChromaDB vector database for efficient retrieval.
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
👨💻 Author
- Aditya Mishra (adi.mishra989@gmail.com)
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📚 Advanced Usage
Customizing Embedding Model
The embedding model significantly impacts the quality of your vector database. By default, the tool works well with Sentence Transformer models. If you want to customize:
- Download or train your preferred model
- Specify the path using
--embedding_model_name
Memory Optimization
For large document collections, memory usage can be a concern. The tool includes automatic memory management:
- Monitors memory usage during processing
- Dynamically reduces batch size if memory threshold is exceeded
- Override the default threshold with
--memory_threshold_mb
Integration with Retrieval Systems
The ChromaDB collections created by this tool can be easily integrated with retrieval augmented generation systems. You can load the collection in your application:
import chromadb
client = chromadb.PersistentClient(path="/path/to/your/database")
collection = client.get_collection(name="your-collection-name")
# Query for similar documents
results = collection.query(
query_texts=["Your query text here"],
n_results=5
)
📌 Development Roadmap
- Support for more embedding models
- Parallel processing for faster execution
- Incremental updates to existing collections
- Improved chunking strategies for code files
- Web UI for database exploration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunk_embed-0.1.0.tar.gz.
File metadata
- Download URL: chunk_embed-0.1.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e6ae4bf32b9b875dff56fe57362c537e2102a64710a5ed86fca502ea265e243
|
|
| MD5 |
2392373204d0624d05ebb4523ff69abb
|
|
| BLAKE2b-256 |
0bae3b0d59090ea4a80bff20dcd493e583ac0bf9d879ced2291ef5a49322e5bd
|
File details
Details for the file chunk_embed-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chunk_embed-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2e21f65c461ba0b54feeb2664df9f550f6af2ac5f8f1574b4f53deb8082c7b4
|
|
| MD5 |
a95ae8acd08e00104341b76f03870ad7
|
|
| BLAKE2b-256 |
0f4656124e28195ada175b9d8797c7960de299c66b2fe79f0afaa6513603aba6
|