Skip to main content

Evolving Low-resource Embedding and Storage System - A resilient RAG data processing pipeline with comprehensive logging, multi-database support, and CLI interface.

Project description

ELESS - Evolving Low-resource Embedding and Storage System

Python 3.8+ License: MIT Tests Code style: black

A resilient RAG (Retrieval-Augmented Generation) data processing pipeline with comprehensive logging, multi-database support, and an intuitive CLI interface. Built for efficiency on low-resource systems while maintaining production-grade reliability.

Features

  • Multi-Database Support: ChromaDB, Qdrant, FAISS, PostgreSQL, Cassandra (install extras for full support)
  • Multiple File Formats: PDF, DOCX, TXT, MD, HTML, and more (install parsers extra)
  • Resumable Processing: Checkpoint-based system for interrupted workflows
  • Comprehensive Logging: Structured logs with rotation and performance tracking
  • Smart Caching: Content-based hashing and atomic manifest writes
  • Flexible Embeddings: Support for various sentence-transformers models (install embeddings extra)
  • Memory Efficient: Streaming processing for large files
  • Production Ready: Graceful error handling and data safety features
  • CLI Interface: Easy-to-use command-line tools
  • Modular Design: Extensible architecture for custom parsers and databases

Note: ELESS gracefully handles missing optional dependencies with warnings. Install extras for full features. For Qdrant and PostgreSQL, ensure the database instances are running on the specified ports (default: Qdrant 6333, PostgreSQL 5432).

Project Structure

For contributors, the project is organized as follows:

  • src/: Source code for the ELESS package
  • tests/: Unit and integration tests
  • docs/: Documentation, including user guides, API reference, and contributing guidelines
  • tools/: Utility scripts for deployment, packaging, and verification
  • config/: Configuration files and templates
  • build/: Build configuration files (setup.py, pyproject.toml, etc.)

Quick Start

Installation

# Install from source
git clone https://github.com/Bandalaro/eless.git
cd eless
pip install -e .

# Install with all features (recommended for full functionality)
pip install -e ".[full]"

# Or install specific extras
pip install -e ".[embeddings,databases,parsers]"

Basic Usage

# Process documents with default settings
eless process /path/to/documents

# Process with specific database
eless process /path/to/documents --databases chroma

# Process with custom settings
eless process /path/to/documents --chunk-size 1000 --log-level DEBUG

# Check processing status
eless status --all

# Resume interrupted processing
eless process /path/to/documents --resume

Python API

from eless import ElessPipeline
import yaml

# Load configuration
with open("config/default_config.yaml") as f:
    config = yaml.safe_load(f)

# Create and run pipeline
pipeline = ElessPipeline(config)
pipeline.run_process("/path/to/documents")

# Check status
files = pipeline.state_manager.get_all_files()
for file in files:
    print(f"{file['path']}: {file['status']}")

๐Ÿ“‹ Requirements

Core Dependencies

  • Python 3.8+
  • click >= 8.0.0
  • PyYAML >= 6.0
  • numpy >= 1.21.0
  • psutil >= 5.8.0

Optional Dependencies

Embeddings:

pip install sentence-transformers torch

Databases:

# ChromaDB
pip install chromadb langchain-community langchain-core

# Qdrant
pip install qdrant-client

# FAISS
pip install faiss-cpu  # or faiss-gpu for GPU support

# PostgreSQL
pip install psycopg2-binary

# Cassandra
pip install cassandra-driver

Document Parsers:

pip install pypdf python-docx openpyxl pandas beautifulsoup4 lxml

All Features:

pip install -e ".[full]"

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         CLI Interface (Click)           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚      ElessPipeline (Orchestrator)       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Scanner  โ”‚Dispatcherโ”‚  State Manager    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Parsers  โ”‚ Chunker  โ”‚  Archiver         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚      Embedder       โ”‚  Resource Monitor โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Database Loader    โ”‚  Logging System   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components

  • FileScanner: Discovers and hashes files using SHA-256
  • Dispatcher: Routes files to appropriate parsers
  • TextChunker: Intelligent text segmentation with overlap
  • Embedder: Generates vector embeddings with caching
  • DatabaseLoader: Multi-database coordination
  • StateManager: Tracks processing state with atomic writes
  • ResourceMonitor: Adaptive resource management

๐ŸŽ›๏ธ Configuration

Create a config.yaml file or modify config/default_config.yaml:

# Logging
logging:
  directory: .eless_logs
  level: INFO
  enable_console: true

# Embedding
embedding:
  model_name: all-MiniLM-L6-v2
  device: cpu
  batch_size: 32

# Chunking
chunking:
  chunk_size: 500
  overlap: 50
  strategy: semantic

# Databases
databases:
  targets:
    - chroma
  connections:
    chroma:
      type: chroma
      path: .eless_chroma
      collection_name: eless_vectors

# Resource Limits
resource_limits:
  max_memory_mb: 512
  enable_adaptive_batching: true

# Streaming
streaming:
  buffer_size: 8192
  max_file_size_mb: 100
  auto_streaming_threshold: 0.7

Documentation

Use Cases

Document Processing Pipeline

# Process research papers
eless process papers/ \
  --databases chroma \
  --chunk-size 1000 \
  --log-level INFO

RAG System Setup

# Index documentation
eless process docs/ \
  --databases qdrant \
  --databases faiss

# Query your RAG application
python query_rag.py "machine learning techniques"

Batch Processing

# Process multiple directories
for dir in dataset1 dataset2 dataset3; do
  eless process "$dir" --databases chroma --resume
done

CLI Commands

Process Documents

eless process <path> [OPTIONS]

Options:
  --databases, -db <name>    Select databases (repeatable)
  --config <file>            Custom configuration file
  --resume                   Resume interrupted processing
  --chunk-size <size>        Override chunk size
  --batch-size <size>        Override batch size
  --log-level <level>        Set log level
  --log-dir <path>           Custom log directory

Check Status

eless status [OPTIONS]

Options:
  --all                      Show all tracked files
  <file_id>                  Show specific file details

System Management

eless config-info          # Display configuration
eless test                 # Run system tests
eless logs [--days N]      # Manage log files

Testing

# Run all tests
pytest tests/

# Run specific test suite
pytest tests/test_cli.py -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Test results: 56/56 passing โœ…

Contributing

We welcome contributions! Please see docs/CONTRIBUTING.md for guidelines.

Development Setup

# Clone and setup
git clone https://github.com/Bandalaro/eless.git
cd eless
python3 -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -e ".[dev,full]"

# Run tests
pytest tests/

# Format code
black src/ tests/

# Check linting
flake8 src/ tests/

Performance

Optimized for Low-Resource Systems

resource_limits:
  max_memory_mb: 256
  enable_adaptive_batching: true

embedding:
  batch_size: 8

streaming:
  auto_streaming_threshold: 0.5

High-Performance Configuration

resource_limits:
  max_memory_mb: 4096

embedding:
  batch_size: 128
  device: cuda

parallel:
  enable: true
  max_workers: 8

Troubleshooting

Common Issues

Missing Dependencies:

# Install embedding support
pip install sentence-transformers

# Install database support
pip install chromadb langchain-community

Memory Issues:

# Reduce memory usage
embedding:
  batch_size: 8
streaming:
  auto_streaming_threshold: 0.5

Slow Processing:

# Increase performance
embedding:
  batch_size: 64
parallel:
  enable: true
  max_workers: 4

See docs/QUICK_START.md for more solutions.

Project Status

  • 56/56 tests passing
  • Zero warnings
  • Production ready
  • Comprehensive documentation
  • Active development

Roadmap

  • PyPI publication
  • Additional database connectors (Milvus, Weaviate)
  • Web interface
  • Docker support
  • Distributed processing
  • Advanced query capabilities

License

This project is licensed under the MIT License - see the docs/LICENSE file for details.

Acknowledgments

Support

Star History

If you find ELESS useful, please consider giving it a star on GitHub!


Made with love by Bandalaro

Status: Production Ready | Version: 1.0.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eless-1.0.3.tar.gz (91.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eless-1.0.3-py3-none-any.whl (104.0 kB view details)

Uploaded Python 3

File details

Details for the file eless-1.0.3.tar.gz.

File metadata

  • Download URL: eless-1.0.3.tar.gz
  • Upload date:
  • Size: 91.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for eless-1.0.3.tar.gz
Algorithm Hash digest
SHA256 09bbd93260cd12dd8f9952633e1a6a1d08addcf55b01f6497ec3908e2a9db324
MD5 93ad9e0eca7f216a0e60a61698cf372a
BLAKE2b-256 ee52f2da28b65bbdb1c7cebb469c0343a48975d4cd8873908f789424ba9b81ad

See more details on using hashes here.

File details

Details for the file eless-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: eless-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 104.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for eless-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 776969e31fe33c536dcd0c79b80d8888a9a640fe64e0aaed8ee2404eed9e52bd
MD5 d69f6aea163862b3edd00e3f33e2894f
BLAKE2b-256 30e291d0f7fc76451af5ae7fa23e8ab5d29bff0d93f7b7a8125f1f13b7ffdd45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page