Chunk Embed Store - Chunk & Embed Knowledge Base

These details have not been verified by PyPI

Project description

Chunk Embed Store

A powerful tool for chunking, embedding, and storing documents in a vector database for efficient retrieval and semantic search capabilities.

🔍 Overview

Chunk Embed Store is a Python tool designed to process, chunk, and embed text documents into a vector database. It analyzes your code base or documentation, breaks it into manageable chunks, generates semantic embeddings, and stores them in a ChromaDB vector database for later retrieval.

✨ Features

Intelligent Document Processing: Supports a wide variety of file types (code files, documentation, text files, etc.)
Smart Chunking: Breaks documents into optimal chunks while preserving context
High-Quality Embeddings: Uses Sentence Transformers to generate semantic embeddings
Efficient Storage: Stores document chunks and embeddings in ChromaDB for fast retrieval
Memory-Aware Processing: Dynamically adjusts batch size based on available memory
Resilient Operation: Includes retry mechanisms and error handling
Progress Tracking: Shows real-time progress with helpful statistics

📋 Prerequisites

Python 3.10 or higher
A pre-trained Sentence Transformer model
Sufficient disk space for the vector database
Sufficient RAM for processing (8GB recommended)

🔧 Installation

From PyPI

pip install chunk-embed

From Source

git clone https://github.com/yourusername/chunk-embed-store.git
cd chunk-embed-store
pip install -e .

💻 Usage

Basic Usage

chunk-embed \
  --base_dir "/path/to/your/documents" \
  --collection_name "your-collection" \
  --embedding_model_name "/path/to/embedding/model" \
  --persist_dir "/path/to/store/database"

With UVX

You can also run the tool with UVX for an enhanced experience:

uvx chunk-embed \
  --base_dir "/path/to/your/documents" \
  --collection_name "your-collection" \
  --embedding_model_name "/path/to/embedding/model" \
  --persist_dir "/path/to/store/database"

All Options

Parameter	Required	Default	Description
`--base_dir`	Yes	-	Directory containing documents to process
`--collection_name`	Yes	-	Name for the ChromaDB collection
`--embedding_model_name`	Yes	-	Path to Sentence Transformer model
`--persist_dir`	Yes	-	Directory to store the vector database
`--chunk_size`	No	500	Size of text chunks
`--chunk_overlap`	No	50	Overlap between text chunks
`--batch_size`	No	1000	Batch size for processing
`--memory_threshold_mb`	No	8000	Memory threshold in MB
`--max_retries`	No	3	Maximum retries for embedding generation

📊 Example

Process a code repository and store embeddings in a local database:

chunk-embed \
  --base_dir "/Users/username/projects/my-repo" \
  --collection_name "my-repo-knowledge" \
  --embedding_model_name "/Users/username/models/all-MiniLM-L6-v2" \
  --persist_dir "/Users/username/vector-db/my-repo-db" \
  --chunk_size 600 \
  --chunk_overlap 75

📁 Supported File Types

The tool processes a wide variety of file types including:

Code files: .py, .js, .java, .cpp, .go, etc.
Documentation: .md, .txt, .html, etc.
Configuration: .yml, .json, .toml, etc.
And many more (over 100 file extensions supported)

🔄 How It Works

Loading: The tool recursively explores your specified directory and loads files with supported extensions.
Chunking: Each document is split into smaller chunks with configurable size and overlap, trying to respect natural boundaries like line breaks.
Embedding: The chunks are processed through a Sentence Transformer model to generate vector embeddings.
Storage: The chunks and their embeddings are stored in a ChromaDB vector database for efficient retrieval.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

Aditya Mishra (adi.mishra989@gmail.com)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📚 Advanced Usage

Customizing Embedding Model

The embedding model significantly impacts the quality of your vector database. By default, the tool works well with Sentence Transformer models. If you want to customize:

Download or train your preferred model
Specify the path using --embedding_model_name

Memory Optimization

For large document collections, memory usage can be a concern. The tool includes automatic memory management:

Monitors memory usage during processing
Dynamically reduces batch size if memory threshold is exceeded
Override the default threshold with --memory_threshold_mb

Integration with Retrieval Systems

The ChromaDB collections created by this tool can be easily integrated with retrieval augmented generation systems. You can load the collection in your application:

import chromadb

client = chromadb.PersistentClient(path="/path/to/your/database")
collection = client.get_collection(name="your-collection-name")

# Query for similar documents
results = collection.query(
    query_texts=["Your query text here"],
    n_results=5
)

📌 Development Roadmap

Support for more embedding models
Parallel processing for faster execution
Incremental updates to existing collections
Improved chunking strategies for code files
Web UI for database exploration

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

This version

0.1.0

Apr 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_embed-0.1.0.tar.gz (9.8 kB view details)

Uploaded Apr 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunk_embed-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Apr 5, 2025 Python 3

File details

Details for the file chunk_embed-0.1.0.tar.gz.

File metadata

Download URL: chunk_embed-0.1.0.tar.gz
Upload date: Apr 5, 2025
Size: 9.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for chunk_embed-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2e6ae4bf32b9b875dff56fe57362c537e2102a64710a5ed86fca502ea265e243`
MD5	`2392373204d0624d05ebb4523ff69abb`
BLAKE2b-256	`0bae3b0d59090ea4a80bff20dcd493e583ac0bf9d879ced2291ef5a49322e5bd`

See more details on using hashes here.

File details

Details for the file chunk_embed-0.1.0-py3-none-any.whl.

File metadata

Download URL: chunk_embed-0.1.0-py3-none-any.whl
Upload date: Apr 5, 2025
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for chunk_embed-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2e21f65c461ba0b54feeb2664df9f550f6af2ac5f8f1574b4f53deb8082c7b4`
MD5	`a95ae8acd08e00104341b76f03870ad7`
BLAKE2b-256	`0f4656124e28195ada175b9d8797c7960de299c66b2fe79f0afaa6513603aba6`

See more details on using hashes here.

chunk-embed 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Chunk Embed Store

🔍 Overview

✨ Features

📋 Prerequisites

🔧 Installation

From PyPI

From Source

💻 Usage

Basic Usage

With UVX

All Options

📊 Example

📁 Supported File Types

🔄 How It Works

📜 License

👨‍💻 Author

🤝 Contributing

📚 Advanced Usage

Customizing Embedding Model

Memory Optimization

Integration with Retrieval Systems

📌 Development Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes