Skip to main content

Chunk Embed Store - Chunk & Embed Knowledge Base

Project description

Chunk Embed Store

A powerful tool for chunking, embedding, and storing documents in a vector database for efficient retrieval and semantic search capabilities.

🔍 Overview

Chunk Embed Store is a Python tool designed to process, chunk, and embed text documents into a vector database. It analyzes your code base or documentation, breaks it into manageable chunks, generates semantic embeddings, and stores them in a ChromaDB vector database for later retrieval.

✨ Features

  • Intelligent Document Processing: Supports a wide variety of file types (code files, documentation, text files, etc.)
  • Smart Chunking: Breaks documents into optimal chunks while preserving context
  • High-Quality Embeddings: Uses Sentence Transformers to generate semantic embeddings
  • Efficient Storage: Stores document chunks and embeddings in ChromaDB for fast retrieval
  • Memory-Aware Processing: Dynamically adjusts batch size based on available memory
  • Resilient Operation: Includes retry mechanisms and error handling
  • Progress Tracking: Shows real-time progress with helpful statistics

📋 Prerequisites

  • Python 3.10 or higher
  • A pre-trained Sentence Transformer model
  • Sufficient disk space for the vector database
  • Sufficient RAM for processing (8GB recommended)

🔧 Installation

From PyPI

pip install chunk-embed

From Source

git clone https://github.com/yourusername/chunk-embed-store.git
cd chunk-embed-store
pip install -e .

💻 Usage

Basic Usage

chunk-embed \
  --base_dir "/path/to/your/documents" \
  --collection_name "your-collection" \
  --embedding_model_name "/path/to/embedding/model" \
  --persist_dir "/path/to/store/database"

With UVX

You can also run the tool with UVX for an enhanced experience:

uvx chunk-embed \
  --base_dir "/path/to/your/documents" \
  --collection_name "your-collection" \
  --embedding_model_name "/path/to/embedding/model" \
  --persist_dir "/path/to/store/database"

All Options

Parameter Required Default Description
--base_dir Yes - Directory containing documents to process
--collection_name Yes - Name for the ChromaDB collection
--embedding_model_name Yes - Path to Sentence Transformer model
--persist_dir Yes - Directory to store the vector database
--chunk_size No 500 Size of text chunks
--chunk_overlap No 50 Overlap between text chunks
--batch_size No 1000 Batch size for processing
--memory_threshold_mb No 8000 Memory threshold in MB
--max_retries No 3 Maximum retries for embedding generation

📊 Example

Process a code repository and store embeddings in a local database:

chunk-embed \
  --base_dir "/Users/username/projects/my-repo" \
  --collection_name "my-repo-knowledge" \
  --embedding_model_name "/Users/username/models/all-MiniLM-L6-v2" \
  --persist_dir "/Users/username/vector-db/my-repo-db" \
  --chunk_size 600 \
  --chunk_overlap 75

📁 Supported File Types

The tool processes a wide variety of file types including:

  • Code files: .py, .js, .java, .cpp, .go, etc.
  • Documentation: .md, .txt, .html, etc.
  • Configuration: .yml, .json, .toml, etc.
  • And many more (over 100 file extensions supported)

🔄 How It Works

  1. Loading: The tool recursively explores your specified directory and loads files with supported extensions.
  2. Chunking: Each document is split into smaller chunks with configurable size and overlap, trying to respect natural boundaries like line breaks.
  3. Embedding: The chunks are processed through a Sentence Transformer model to generate vector embeddings.
  4. Storage: The chunks and their embeddings are stored in a ChromaDB vector database for efficient retrieval.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

👨‍💻 Author

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📚 Advanced Usage

Customizing Embedding Model

The embedding model significantly impacts the quality of your vector database. By default, the tool works well with Sentence Transformer models. If you want to customize:

  1. Download or train your preferred model
  2. Specify the path using --embedding_model_name

Memory Optimization

For large document collections, memory usage can be a concern. The tool includes automatic memory management:

  • Monitors memory usage during processing
  • Dynamically reduces batch size if memory threshold is exceeded
  • Override the default threshold with --memory_threshold_mb

Integration with Retrieval Systems

The ChromaDB collections created by this tool can be easily integrated with retrieval augmented generation systems. You can load the collection in your application:

import chromadb

client = chromadb.PersistentClient(path="/path/to/your/database")
collection = client.get_collection(name="your-collection-name")

# Query for similar documents
results = collection.query(
    query_texts=["Your query text here"],
    n_results=5
)

📌 Development Roadmap

  • Support for more embedding models
  • Parallel processing for faster execution
  • Incremental updates to existing collections
  • Improved chunking strategies for code files
  • Web UI for database exploration

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_embed-0.1.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunk_embed-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file chunk_embed-0.1.0.tar.gz.

File metadata

  • Download URL: chunk_embed-0.1.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for chunk_embed-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2e6ae4bf32b9b875dff56fe57362c537e2102a64710a5ed86fca502ea265e243
MD5 2392373204d0624d05ebb4523ff69abb
BLAKE2b-256 0bae3b0d59090ea4a80bff20dcd493e583ac0bf9d879ced2291ef5a49322e5bd

See more details on using hashes here.

File details

Details for the file chunk_embed-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunk_embed-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for chunk_embed-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2e21f65c461ba0b54feeb2664df9f550f6af2ac5f8f1574b4f53deb8082c7b4
MD5 a95ae8acd08e00104341b76f03870ad7
BLAKE2b-256 0f4656124e28195ada175b9d8797c7960de299c66b2fe79f0afaa6513603aba6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page