Evolving Low-resource Embedding and Storage System - A resilient RAG data processing pipeline with comprehensive logging, multi-database support, and CLI interface.
Project description
ELESS - Evolving Low-resource Embedding and Storage System
A resilient RAG (Retrieval-Augmented Generation) data processing pipeline with comprehensive logging, multi-database support, and an intuitive CLI interface. Built for efficiency on low-resource systems while maintaining production-grade reliability.
Features
- Multi-Database Support: ChromaDB, Qdrant, FAISS, PostgreSQL, Cassandra (install extras for full support)
- Multiple File Formats: PDF, DOCX, TXT, MD, HTML, and more (install parsers extra)
- Resumable Processing: Checkpoint-based system for interrupted workflows
- Comprehensive Logging: Structured logs with rotation and performance tracking
- Smart Caching: Content-based hashing and atomic manifest writes
- Flexible Embeddings: Support for various sentence-transformers models (install embeddings extra)
- Memory Efficient: Streaming processing for large files
- Production Ready: Graceful error handling and data safety features
- CLI Interface: Easy-to-use command-line tools
- Modular Design: Extensible architecture for custom parsers and databases
Note: ELESS gracefully handles missing optional dependencies with warnings. Install extras for full features. For Qdrant and PostgreSQL, ensure the database instances are running on the specified ports (default: Qdrant 6333, PostgreSQL 5432).
Project Structure
For contributors, the project is organized as follows:
src/: Source code for the ELESS packagetests/: Unit and integration testsdocs/: Documentation, including user guides, API reference, and contributing guidelinestools/: Utility scripts for deployment, packaging, and verificationconfig/: Configuration files and templatesbuild/: Build configuration files (setup.py, pyproject.toml, etc.)
Quick Start
Installation
# Install from source
git clone https://github.com/Bandalaro/eless.git
cd eless
pip install -e .
# Install with all features (recommended for full functionality)
pip install -e ".[full]"
# Or install specific extras
pip install -e ".[embeddings,databases,parsers]"
Basic Usage
# Process documents with default settings
eless process /path/to/documents
# Process with specific database
eless process /path/to/documents --databases chroma
# Process with custom settings
eless process /path/to/documents --chunk-size 1000 --log-level DEBUG
# Check processing status
eless status --all
# Resume interrupted processing
eless process /path/to/documents --resume
Python API
from eless import ElessPipeline
import yaml
# Load configuration
with open("config/default_config.yaml") as f:
config = yaml.safe_load(f)
# Create and run pipeline
pipeline = ElessPipeline(config)
pipeline.run_process("/path/to/documents")
# Check status
files = pipeline.state_manager.get_all_files()
for file in files:
print(f"{file['path']}: {file['status']}")
๐ Requirements
Core Dependencies
- Python 3.8+
- click >= 8.0.0
- PyYAML >= 6.0
- numpy >= 1.21.0
- psutil >= 5.8.0
Optional Dependencies
Embeddings:
pip install sentence-transformers torch
Databases:
# ChromaDB
pip install chromadb langchain-community langchain-core
# Qdrant
pip install qdrant-client
# FAISS
pip install faiss-cpu # or faiss-gpu for GPU support
# PostgreSQL
pip install psycopg2-binary
# Cassandra
pip install cassandra-driver
Document Parsers:
pip install pypdf python-docx openpyxl pandas beautifulsoup4 lxml
All Features:
pip install -e ".[full]"
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Interface (Click) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ElessPipeline (Orchestrator) โ
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโค
โ Scanner โDispatcherโ State Manager โ
โโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ Parsers โ Chunker โ Archiver โ
โโโโโโโโโโโโดโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ Embedder โ Resource Monitor โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ Database Loader โ Logging System โ
โโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
Key Components
- FileScanner: Discovers and hashes files using SHA-256
- Dispatcher: Routes files to appropriate parsers
- TextChunker: Intelligent text segmentation with overlap
- Embedder: Generates vector embeddings with caching
- DatabaseLoader: Multi-database coordination
- StateManager: Tracks processing state with atomic writes
- ResourceMonitor: Adaptive resource management
๐๏ธ Configuration
Create a config.yaml file or modify config/default_config.yaml:
# Logging
logging:
directory: .eless_logs
level: INFO
enable_console: true
# Embedding
embedding:
model_name: all-MiniLM-L6-v2
device: cpu
batch_size: 32
# Chunking
chunking:
chunk_size: 500
overlap: 50
strategy: semantic
# Databases
databases:
targets:
- chroma
connections:
chroma:
type: chroma
path: .eless_chroma
collection_name: eless_vectors
# Resource Limits
resource_limits:
max_memory_mb: 512
enable_adaptive_batching: true
# Streaming
streaming:
buffer_size: 8192
max_file_size_mb: 100
auto_streaming_threshold: 0.7
Documentation
- Quick Start Guide - Get started in 5 minutes
- API Reference - Complete API documentation
- Developer Guide - Contributing and development
- Documentation Index - All documentation
Use Cases
Document Processing Pipeline
# Process research papers
eless process papers/ \
--databases chroma \
--chunk-size 1000 \
--log-level INFO
RAG System Setup
# Index documentation
eless process docs/ \
--databases qdrant \
--databases faiss
# Query your RAG application
python query_rag.py "machine learning techniques"
Batch Processing
# Process multiple directories
for dir in dataset1 dataset2 dataset3; do
eless process "$dir" --databases chroma --resume
done
CLI Commands
Process Documents
eless process <path> [OPTIONS]
Options:
--databases, -db <name> Select databases (repeatable)
--config <file> Custom configuration file
--resume Resume interrupted processing
--chunk-size <size> Override chunk size
--batch-size <size> Override batch size
--log-level <level> Set log level
--log-dir <path> Custom log directory
Check Status
eless status [OPTIONS]
Options:
--all Show all tracked files
<file_id> Show specific file details
System Management
eless config-info # Display configuration
eless test # Run system tests
eless logs [--days N] # Manage log files
Testing
# Run all tests
pytest tests/
# Run specific test suite
pytest tests/test_cli.py -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
# Test results: 56/56 passing โ
Contributing
We welcome contributions! Please see docs/CONTRIBUTING.md for guidelines.
Development Setup
# Clone and setup
git clone https://github.com/Bandalaro/eless.git
cd eless
python3 -m venv venv
source venv/bin/activate
# Install development dependencies
pip install -e ".[dev,full]"
# Run tests
pytest tests/
# Format code
black src/ tests/
# Check linting
flake8 src/ tests/
Performance
Optimized for Low-Resource Systems
resource_limits:
max_memory_mb: 256
enable_adaptive_batching: true
embedding:
batch_size: 8
streaming:
auto_streaming_threshold: 0.5
High-Performance Configuration
resource_limits:
max_memory_mb: 4096
embedding:
batch_size: 128
device: cuda
parallel:
enable: true
max_workers: 8
Troubleshooting
Common Issues
Missing Dependencies:
# Install embedding support
pip install sentence-transformers
# Install database support
pip install chromadb langchain-community
Memory Issues:
# Reduce memory usage
embedding:
batch_size: 8
streaming:
auto_streaming_threshold: 0.5
Slow Processing:
# Increase performance
embedding:
batch_size: 64
parallel:
enable: true
max_workers: 4
See docs/QUICK_START.md for more solutions.
Project Status
- 56/56 tests passing
- Zero warnings
- Production ready
- Comprehensive documentation
- Active development
Roadmap
- PyPI publication
- Additional database connectors (Milvus, Weaviate)
- Web interface
- Docker support
- Distributed processing
- Advanced query capabilities
License
This project is licensed under the MIT License - see the docs/LICENSE file for details.
Acknowledgments
- Built with sentence-transformers
- Supports ChromaDB, Qdrant, and more
- Powered by the Python ecosystem
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: docs/
Star History
If you find ELESS useful, please consider giving it a star on GitHub!
Made with love by Bandalaro
Status: Production Ready | Version: 1.0.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eless-1.0.3.tar.gz.
File metadata
- Download URL: eless-1.0.3.tar.gz
- Upload date:
- Size: 91.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09bbd93260cd12dd8f9952633e1a6a1d08addcf55b01f6497ec3908e2a9db324
|
|
| MD5 |
93ad9e0eca7f216a0e60a61698cf372a
|
|
| BLAKE2b-256 |
ee52f2da28b65bbdb1c7cebb469c0343a48975d4cd8873908f789424ba9b81ad
|
File details
Details for the file eless-1.0.3-py3-none-any.whl.
File metadata
- Download URL: eless-1.0.3-py3-none-any.whl
- Upload date:
- Size: 104.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
776969e31fe33c536dcd0c79b80d8888a9a640fe64e0aaed8ee2404eed9e52bd
|
|
| MD5 |
d69f6aea163862b3edd00e3f33e2894f
|
|
| BLAKE2b-256 |
30e291d0f7fc76451af5ae7fa23e8ab5d29bff0d93f7b7a8125f1f13b7ffdd45
|