Skip to main content

๐Ÿง  SAARA - Autonomous Document-to-LLM Data Engine with Pre-training, Cloud Runtime & AI Tokenizer

Project description

๐Ÿง  SAARA: Autonomous Document-to-LLM Data Engine

Python 3.10+ Gemini Powered Gemma Models License

๐Ÿ† Built for Google Gemini Hackathon - Showcasing the power of Gemini 2.0 Flash and Gemma 2 models in autonomous AI training pipelines.

SAARA is an end-to-end autonomous data pipeline designed to transform raw, unstructured documents (PDFs, research papers) into high-quality, instruction-tuned datasets for fine-tuning Large Language Models (LLMs).

Why this exists: Creating high-quality datasets is the bottleneck in training domain-specific AI. This tool automates the "boring stuff"โ€”OCR, chunking, labeling, and cleaningโ€”allowing you to go from PDF to fine-tuned model in hours, not weeks.


๐ŸŒŸ Gemini & Gemma Integration

Gemini 2.0 Flash - AI Teacher & Evaluator

  • Default Teacher Model: Uses Gemini 2.0 Flash for autonomous learning
  • Quality Evaluation: Scores and improves model responses
  • Data Generation: Creates high-quality training examples
  • Self-Improvement: Iterative correction loop powered by Gemini

Gemma 2 - Fine-Tuning Targets

  • Gemma 2 2B: Lightweight, CPU-trainable, perfect for domain-specific models
  • Gemma 2 9B: Production-ready with excellent performance
  • Pre-configured: Optimized LoRA settings for Gemma architecture
  • First-Class Support: Gemma models are highlighted and recommended

๐Ÿš€ Key Features

1. ๐Ÿ‘๏ธ SOTA Vision-LLM OCR

  • No more Garbled Text: Uses Moondream and Qwen2.5-VL (Vision-Language Models) to "read" PDFs visually.
  • Handles complex double-column layouts, tables, and scientific diagrams that traditional OCR (Tesseract) fails on.
  • Hybrid Fallback: Automatically switches between PyMuPDF (fast) and Vision OCR (accurate) based on page extractability.

2. ๐Ÿค– Autonomous Data Labeling (Gemini-Powered)

  • Uses Gemini 2.0 Flash as the default teacher model for:
    • Instruction Tuning: "How do I treat X using Ayurveda?"
    • Q&A Pairs: Fact-based extraction.
    • Summarization: TL;DRs of complex sections.
    • Classification: Topic tagging.

3. ๐Ÿงช Data Distillation & Hygiene

  • Self-Cleaning: The distill module removes low-quality generations, duplicates, and confabulations.
  • ShareGPT Formatting: Automatically converts raw data into the industry-standard conversation format.

4. ๐Ÿ—๏ธ Pre-training from Scratch

  • Build Your Own LLM: Create custom models from 15M to 3B parameters.
  • Custom Tokenizers: Train domain-specific BPE tokenizers on your data.
  • Full Pipeline: Pre-train โ†’ Fine-tune โ†’ Evaluate โ†’ Deploy.
  • Production-ready LLaMA-style architectures.

5. ๐ŸŽ“ Native Fine-Tuning Support (Gemma Optimized)

  • Gemma 2 First-Class Support: Pre-configured LoRA settings for optimal Gemma performance.
  • One-Command Training: Built-in training loop using SFTTrainer (QLoRA).
  • Multi-Format Support: Automatically handles ShareGPT, Alpaca, and Raw Text formats.
  • Optimized for consumer GPUs (supports 4-bit quantization).

6. ๐Ÿงช Model Evaluation & Self-Improvement (Gemini Judge)

  • Gemini 2.0 as Judge: Test your fine-tuned model with automatic quality scoring.
  • Self-Improvement Loop: Low-scoring responses are corrected by Gemini and used for next training round.
  • Iterative Enhancement: Train โ†’ Evaluate โ†’ Improve โ†’ Repeat.

7. ๐Ÿš€ Model Deployment

  • Local Chat: Interactive terminal testing with your model.
  • Ollama Export: Convert to GGUF format for Ollama usage.
  • HuggingFace Hub: Push your model to share with the community.
  • Cloud Deployment: Docker + Google Cloud Run ready.

8. โšก Neural Accelerator (NEW)

  • Automatic GPU Optimization: Detects CUDA/CPU/MPS and configures optimal settings.
  • Mixed Precision Training: FP16/BF16 for faster training with less memory.
  • Gradient Accumulation: Train with larger effective batch sizes.
  • Memory Efficient Attention: Flash Attention / Memory-Efficient SDPA.
  • Smart Recommendations: Suggests optimal batch size, sequence length based on your GPU.

9. ๐Ÿ“Š Neural Network Visualizer (NEW)

  • Architecture Visualization: Beautiful console display of model layers.
  • Live Training Dashboard: Real-time metrics, loss curves, and throughput.
  • HTML Reports: Generate stunning training reports with Chart.js.
  • Model Analysis: Inspect any PyTorch model's structure and parameters.

10. โ˜๏ธ Cloud Runtime (NEW)

  • Run on Google Colab: Full support without Ollama dependency.
  • API-Based Labeling: Use Gemini, GPT-4, DeepSeek, Groq, or HuggingFace for text processing.
  • Auto-Detection: Automatically detects Colab, Kaggle, SageMaker, etc.
  • Optimized Settings: Recommends training parameters based on cloud GPU.

11. ๐Ÿค– AI-Enhanced Tokenizer (NEW)

  • Domain-Aware Vocabulary: AI extracts medical, legal, code, or scientific terms.
  • Protected Tokens: Domain terms are never split by BPE.
  • Smart Segmentation: AI-guided subword merging for semantic coherence.
  • Multi-Domain Support: Medical, legal, code, scientific, and general domains.
  • Integrated Selection: Choose tokenizer during training/pretraining wizards.
  • Multiple Providers: Auto-detect, Ollama, Gemini, OpenAI, or rule-based.

12. ๐Ÿ” RAG Agent Builder (NEW)

  • Build Knowledge Bases: Index PDFs, text files, and JSONL datasets.
  • Semantic Search: ChromaDB-powered vector search with sentence-transformers.
  • Interactive Chat: Query your documents with natural language.
  • Multi-Step Wizard: Create RAG agents with back navigation and step indicators.
  • REST API Server: Deploy as an API endpoint for integration.
  • Citation Tracking: Responses include source references.
  • Multiple Embedding Models: all-MiniLM-L6-v2, all-mpnet-base-v2, or Ollama embeddings.

๐Ÿ› ๏ธ Architecture

graph LR
    A[Raw PDF] --> B(Vision OCR / Extractor)
    B --> C{Chunker Strategy}
    C --> D[Synthetic Labeling Agent]
    D --> E[Raw Dataset JSONL]
    E --> F(Data Distiller)
    F --> G[Clean ShareGPT Dataset]
    G --> H{Training Path}
    H -->|Pre-train| I[Build New Model]
    H -->|Fine-tune| J[Adapt Existing Model]
    I --> K[Model Evaluation]
    J --> K
    K --> L{Score < 7?}
    L -->|Yes| M[Generate Corrections]
    M --> J
    L -->|No| N((Deploy Model))

๐Ÿ“ฆ Installation

  1. Clone the repository:

    git clone https://github.com/nikhil49023/Data-engine.git
    cd Data-engine
    
  2. Install the CLI:

    pip install -e .
    
  3. Setup Ollama:

    • Install Ollama
    • The setup wizard will help you install models automatically

Quick Start

First-time setup (recommended):

saara setup

The setup wizard will:

  1. โœ… Detect your hardware (GPU, VRAM, RAM)
  2. โœ… Recommend optimal models for your system
  3. โœ… Install selected vision and analyzer models
  4. โœ… Save configuration

โšก Usage

๐ŸŽฏ Interactive Wizard (Recommended)

saara run

This launches a beautiful CLI wizard with 5 workflows:

Option Mode Description
1 ๐Ÿ“„ Dataset Creation Extract data from PDFs โ†’ Generate training datasets
2 ๐Ÿง  Model Training Fine-tune LLMs on your prepared data
3 ๐Ÿงช Model Evaluation Test & improve models with Granite 4
4 ๐Ÿš€ Model Deployment Deploy locally (Ollama) or to cloud
5 ๐Ÿ—๏ธ Pre-training Build & train a model from scratch

๐Ÿ—๏ธ Pre-training from Scratch (NEW)

Build your own language model from the ground up:

saara pretrain

Available Architectures:

Name Parameters VRAM Use Case
Nano ~15M 2GB+ Testing, learning (CPU trainable)
Micro ~50M 4GB+ Experimentation
Mini ~125M 6GB+ Domain-specific pre-training
Small ~350M 8GB+ Specialized tasks
Base ~1B 16GB+ Production models
Large ~3B 24GB+ High-capacity models

Pre-training Sub-menu:

  1. ๐Ÿ“š Create Pre-training Dataset
  2. ๐Ÿ—๏ธ Build & Train New Model
  3. ๐Ÿ”ค Train Custom Tokenizer
  4. ๐Ÿงช Test Pre-trained Model
  5. ๐Ÿ“‹ List Pre-trained Models

Pre-training Dataset Creation:

  • Extracts raw text from PDFs, markdown, and text files
  • Cleans OCR artifacts and normalizes unicode
  • Chunks text into optimal sizes for language modeling
  • LLM-Enhanced Processing (Optional):
    • Uses local LLM (Granite 4, Llama 3, Qwen) to clean and improve text
    • Fixes OCR errors and expands abbreviations
    • LLM-based quality scoring for more accurate filtering
  • Quality filtering (removes low-quality/incoherent text)
  • Deduplication (prevents model memorization)
  • Outputs in JSONL format ready for training
  • Optional train/validation split

Workflow:

Create Dataset โ†’ Train Tokenizer (optional) โ†’ Pre-train Model โ†’ Test โ†’ Fine-tune โ†’ Deploy

๐Ÿ“„ Dataset Creation Flow

  1. Select input PDF folder and output directory
  2. Choose Vision OCR model (Moondream/Qwen) - auto-detects available models
  3. Choose Analyzer model (Granite 4/Llama 3/Qwen 2.5/Mistral)
  4. Configure advanced options (chunk size, Q&A density)
  5. Pipeline automatically generates:
    • *_instruction.jsonl - Instruction tuning data
    • *_qa.jsonl - Q&A pairs
    • *_sharegpt.jsonl - Chat format (best for training)
    • *_summarization.jsonl - Summarization tasks

๐Ÿง  Model Training Flow

The training wizard now supports:

  • Gemma 2 Models: Recommended for best quality-to-cost ratio
  • Custom Pre-trained: Your own pre-trained models
  • Fine-tuned Adapters: Continue training existing adapters

Supported Base Models (Gemma First):

Model Size Best For
โญ google/gemma-2-2b 2B Recommended - Efficient, CPU-trainable
โญ google/gemma-2-9b 9B Production-ready, high quality
google/gemma-2b 2B General Purpose
google/gemma-7b 7B Higher capacity
sarvamai/sarvam-1 2B Indian Languages
TinyLlama/TinyLlama-1.1B 1.1B Fast Testing

Output: models/{model-name}-finetuned/final_adapter/


๐Ÿงช Model Evaluation Flow (Gemini-Powered)

Uses Gemini 2.0 Flash to evaluate your fine-tuned model:

  1. Runs test prompts through your model
  2. Scores each response (1-10) using Gemini
  3. Generates improved responses for low scores
  4. Creates correction data for next training round

Self-Improvement Cycle:

Train Model โ†’ Evaluate (Gemini 2.0) โ†’ Generate Corrections โ†’ Retrain โ†’ Repeat

๐Ÿš€ Model Deployment Flow

Option Platform Description
1 Local Chat Interactive terminal chat
2 Ollama Export Convert to GGUF format
3 HuggingFace Push to HF Hub
4 Cloud Deploy Docker + Google Cloud Run
5 Merge Model Merge adapter with base

๐Ÿ“Ÿ CLI Commands

Core Commands

Command Description
saara run Start interactive wizard
saara pretrain Build & train model from scratch
saara setup First-time hardware detection & model setup
saara version Show version information

Data Processing

Command Description
saara process <file> Process a single PDF file
saara batch <dir> Process all PDFs in directory
saara distill <input> Generate synthetic training data

Model Operations

Command Description
saara train Fine-tune a model (interactive)
saara deploy Deploy a trained model
saara evaluate <base> <adapter> Evaluate model quality

Model Management

Command Description
saara models list List all available models
saara models install <name> Install an Ollama model
saara models remove <name> Remove a model
saara models status Show hardware & model status
saara models info <name> Show detailed model info
saara models storage Show disk usage breakdown
saara models clear checkpoints Delete all training checkpoints
saara models clear models --yes Delete ALL trained models
saara models clear all --yes Factory reset (delete everything)
saara models retrain <name> Delete & retrain from scratch

Accelerator & Visualizer (NEW)

Command Description
saara accelerator Show GPU status & recommended settings
saara visualize Visualize neural network architecture
saara visualize --report Generate HTML training report
saara benchmark Benchmark training performance

Cloud Runtime (NEW)

Command Description
saara cloud info Show cloud environment info
saara cloud setup Configure cloud API keys
saara cloud quickstart Show Colab quickstart guide

AI Tokenizer (NEW)

Command Description
saara tokenizer train Train AI-enhanced tokenizer
saara tokenizer train --domain medical Train with medical vocabulary
saara tokenizer info -o path/to/tokenizer Show tokenizer info
saara tokenizer test -o path/to/tokenizer Test tokenization interactively

RAG Agent (NEW)

Command Description
saara rag create <name> Create a new knowledge base
saara rag add <kb> <path> Add documents to a knowledge base
saara rag chat <kb> Interactive chat with knowledge base
saara rag search <kb> "query" Search without generation
saara rag list List all knowledge bases
saara rag info <kb> Show knowledge base details
saara rag serve <kb> Start RAG API server
saara rag delete <kb> Delete a knowledge base
saara rag clear <kb> Clear documents (keep KB)

Server

Command Description
saara serve Start REST API server

๐Ÿ“ Project Structure

Data-engine/
โ”œโ”€โ”€ setup.py                # Package setup
โ”œโ”€โ”€ config.yaml             # Configuration settings
โ”œโ”€โ”€ requirements.txt        # Dependencies
โ”œโ”€โ”€ SAARA_Colab.ipynb      # Google Colab notebook (NEW)
โ”œโ”€โ”€ saara/                  # Source code
โ”‚   โ”œโ”€โ”€ cli.py             # CLI entry point
โ”‚   โ”œโ”€โ”€ pipeline.py         # Core data pipeline
โ”‚   โ”œโ”€โ”€ pretrain.py         # Pre-training module
โ”‚   โ”œโ”€โ”€ train.py            # LLM fine-tuning module
โ”‚   โ”œโ”€โ”€ evaluator.py        # Model evaluation
โ”‚   โ”œโ”€โ”€ deployer.py         # Deployment utilities
โ”‚   โ”œโ”€โ”€ distiller.py        # Data cleaning
โ”‚   โ”œโ”€โ”€ model_manager.py    # Ollama model management
โ”‚   โ”œโ”€โ”€ accelerator.py      # Neural accelerator
โ”‚   โ”œโ”€โ”€ visualizer.py       # Training visualizer
โ”‚   โ”œโ”€โ”€ cloud_runtime.py    # Cloud runtime
โ”‚   โ”œโ”€โ”€ rag_engine.py       # RAG Agent engine (NEW)
โ”‚   โ””โ”€โ”€ splash.py           # SAARA splash screen
โ”œโ”€โ”€ models/                 # Saved models (pre-trained & fine-tuned)
โ”œโ”€โ”€ datasets/               # Generated datasets
โ”œโ”€โ”€ tokenizers/             # Custom tokenizers
โ”œโ”€โ”€ knowledge_bases/        # RAG knowledge bases (NEW)
โ”œโ”€โ”€ evaluations/            # Evaluation results
โ”œโ”€โ”€ reports/                # Training reports
โ””โ”€โ”€ exports/                # Deployment artifacts

๐Ÿ”ฎ Roadmap

  • Vision-LLM OCR (Moondream, Qwen)
  • Autonomous data labeling
  • Multi-format dataset generation
  • Native fine-tuning with QLoRA
  • Model evaluation with Granite 4
  • Self-improvement training loop
  • Local & cloud deployment
  • Pre-training from scratch
  • Custom tokenizer training
  • Iterative adapter fine-tuning
  • Neural Accelerator (GPU optimization)
  • Training Visualizer (live dashboard, HTML reports)
  • Cloud Runtime (Colab/Kaggle support)
  • RAG Agent Builder (knowledge bases, semantic search, chat)
  • Multi-modal dataset generation (images + text)
  • Web UI dashboard

๐Ÿ“„ License

Proprietary License - Copyright ยฉ 2025-2026 Kilani Sai Nikhil. All Rights Reserved.

This software is provided under a proprietary license with the following terms:

โœ… Permitted:

  • Use the software for personal, educational, or commercial purposes
  • Reference in academic/educational contexts with attribution

โŒ Not Permitted:

  • Modify, alter, or create derivative works
  • Reproduce, copy, or duplicate the software
  • Distribute, sublicense, or sell the software
  • Reverse engineer or decompile the software

See the LICENSE file for full details.


๐Ÿ‘ค Author

Kilani Sai Nikhil - GitHub


Built with โค๏ธ for the AI community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saara_ai-1.6.3.tar.gz (203.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saara_ai-1.6.3-py3-none-any.whl (209.0 kB view details)

Uploaded Python 3

File details

Details for the file saara_ai-1.6.3.tar.gz.

File metadata

  • Download URL: saara_ai-1.6.3.tar.gz
  • Upload date:
  • Size: 203.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for saara_ai-1.6.3.tar.gz
Algorithm Hash digest
SHA256 76c2b8552a32c355c06eed15d86d504ab3aed69dac1576519c804b6fb9ca1a15
MD5 8ce937ef4ae396792362874972128e45
BLAKE2b-256 57e44bbeb45e3215226a0d6710aea729249019e05e7f4a88c25f7f5b3b7e5ad8

See more details on using hashes here.

File details

Details for the file saara_ai-1.6.3-py3-none-any.whl.

File metadata

  • Download URL: saara_ai-1.6.3-py3-none-any.whl
  • Upload date:
  • Size: 209.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for saara_ai-1.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b41e9576b8a2cddc2f3829d587bd69d47734ba4042d927708dff297f33b80733
MD5 a83df491074d4bc3f02195c750da40b1
BLAKE2b-256 e6753ba9e11bbb252b89ffc5ce9df41c383c0e2e8c00416758319cd386f37b03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page