🧠 SAARA - Autonomous Document-to-LLM Data Engine. Gemini-powered training pipeline for Gemma models.

These details have not been verified by PyPI

Project links

Project description

🧠 SAARA: Autonomous Document-to-LLM Data Engine

🏆 Built for Google Gemini Hackathon - Showcasing the power of Gemini 2.0 Flash and Gemma 2 models in autonomous AI training pipelines.

SAARA is an end-to-end autonomous data pipeline designed to transform raw, unstructured documents (PDFs, research papers) into high-quality, instruction-tuned datasets for fine-tuning Large Language Models (LLMs).

Why this exists: Creating high-quality datasets is the bottleneck in training domain-specific AI. This tool automates the "boring stuff"—OCR, chunking, labeling, and cleaning—allowing you to go from PDF to fine-tuned model in hours, not weeks.

🌟 Gemini & Gemma Integration

Gemini 2.0 Flash - AI Teacher & Evaluator

Default Teacher Model: Uses Gemini 2.0 Flash for autonomous learning
Quality Evaluation: Scores and improves model responses
Data Generation: Creates high-quality training examples
Self-Improvement: Iterative correction loop powered by Gemini

Gemma 2 - Fine-Tuning Targets

Gemma 2 2B: Lightweight, CPU-trainable, perfect for domain-specific models
Gemma 2 9B: Production-ready with excellent performance
Pre-configured: Optimized LoRA settings for Gemma architecture
First-Class Support: Gemma models are highlighted and recommended

🚀 Key Features

1. 👁️ SOTA Vision-LLM OCR

No more Garbled Text: Uses Moondream and Qwen2.5-VL (Vision-Language Models) to "read" PDFs visually.
Handles complex double-column layouts, tables, and scientific diagrams that traditional OCR (Tesseract) fails on.
Hybrid Fallback: Automatically switches between PyMuPDF (fast) and Vision OCR (accurate) based on page extractability.

2. 🤖 Autonomous Data Labeling (Gemini-Powered)

Uses Gemini 2.0 Flash as the default teacher model for:
- Instruction Tuning: "How do I treat X using Ayurveda?"
- Q&A Pairs: Fact-based extraction.
- Summarization: TL;DRs of complex sections.
- Classification: Topic tagging.

3. 🧪 Data Distillation & Hygiene

Self-Cleaning: The distill module removes low-quality generations, duplicates, and confabulations.
ShareGPT Formatting: Automatically converts raw data into the industry-standard conversation format.

4. 🏗️ Pre-training from Scratch

Build Your Own LLM: Create custom models from 15M to 3B parameters.
Custom Tokenizers: Train domain-specific BPE tokenizers on your data.
Full Pipeline: Pre-train → Fine-tune → Evaluate → Deploy.
Production-ready LLaMA-style architectures.

5. 🎓 Native Fine-Tuning Support (Gemma Optimized)

Gemma 2 First-Class Support: Pre-configured LoRA settings for optimal Gemma performance.
One-Command Training: Built-in training loop using SFTTrainer (QLoRA).
Multi-Format Support: Automatically handles ShareGPT, Alpaca, and Raw Text formats.
Optimized for consumer GPUs (supports 4-bit quantization).

6. 🧪 Model Evaluation & Self-Improvement (Gemini Judge)

Gemini 2.0 as Judge: Test your fine-tuned model with automatic quality scoring.
Self-Improvement Loop: Low-scoring responses are corrected by Gemini and used for next training round.
Iterative Enhancement: Train → Evaluate → Improve → Repeat.

7. 🚀 Model Deployment

Local Chat: Interactive terminal testing with your model.
Ollama Export: Convert to GGUF format for Ollama usage.
HuggingFace Hub: Push your model to share with the community.
Cloud Deployment: Docker + Google Cloud Run ready.

🛠️ Architecture

graph LR
    A[Raw PDF] --> B(Vision OCR / Extractor)
    B --> C{Chunker Strategy}
    C --> D[Synthetic Labeling Agent]
    D --> E[Raw Dataset JSONL]
    E --> F(Data Distiller)
    F --> G[Clean ShareGPT Dataset]
    G --> H{Training Path}
    H -->|Pre-train| I[Build New Model]
    H -->|Fine-tune| J[Adapt Existing Model]
    I --> K[Model Evaluation]
    J --> K
    K --> L{Score < 7?}
    L -->|Yes| M[Generate Corrections]
    M --> J
    L -->|No| N((Deploy Model))

📦 Installation

Clone the repository:

git clone https://github.com/nikhil49023/Data-engine.git
cd Data-engine

Install the CLI:
```
pip install -e .
```
Setup Ollama:
- Install Ollama
- The setup wizard will help you install models automatically

Quick Start

First-time setup (recommended):

saara setup

The setup wizard will:

✅ Detect your hardware (GPU, VRAM, RAM)
✅ Recommend optimal models for your system
✅ Install selected vision and analyzer models
✅ Save configuration

⚡ Usage

🎯 Interactive Wizard (Recommended)

saara run

This launches a beautiful CLI wizard with 5 workflows:

Option	Mode	Description
1	📄 Dataset Creation	Extract data from PDFs → Generate training datasets
2	🧠 Model Training	Fine-tune LLMs on your prepared data
3	🧪 Model Evaluation	Test & improve models with Granite 4
4	🚀 Model Deployment	Deploy locally (Ollama) or to cloud
5	🏗️ Pre-training	Build & train a model from scratch

🏗️ Pre-training from Scratch (NEW)

Build your own language model from the ground up:

saara pretrain

Available Architectures:

Name	Parameters	VRAM	Use Case
Nano	~15M	2GB+	Testing, learning (CPU trainable)
Micro	~50M	4GB+	Experimentation
Mini	~125M	6GB+	Domain-specific pre-training
Small	~350M	8GB+	Specialized tasks
Base	~1B	16GB+	Production models
Large	~3B	24GB+	High-capacity models

Pre-training Sub-menu:

📚 Create Pre-training Dataset
🏗️ Build & Train New Model
🔤 Train Custom Tokenizer
🧪 Test Pre-trained Model
📋 List Pre-trained Models

Pre-training Dataset Creation:

Extracts raw text from PDFs, markdown, and text files
Cleans OCR artifacts and normalizes unicode
Chunks text into optimal sizes for language modeling
LLM-Enhanced Processing (Optional):
- Uses local LLM (Granite 4, Llama 3, Qwen) to clean and improve text
- Fixes OCR errors and expands abbreviations
- LLM-based quality scoring for more accurate filtering
Quality filtering (removes low-quality/incoherent text)
Deduplication (prevents model memorization)
Outputs in JSONL format ready for training
Optional train/validation split

Workflow:

Create Dataset → Train Tokenizer (optional) → Pre-train Model → Test → Fine-tune → Deploy

📄 Dataset Creation Flow

Select input PDF folder and output directory
Choose Vision OCR model (Moondream/Qwen) - auto-detects available models
Choose Analyzer model (Granite 4/Llama 3/Qwen 2.5/Mistral)
Configure advanced options (chunk size, Q&A density)
Pipeline automatically generates:
- *_instruction.jsonl - Instruction tuning data
- *_qa.jsonl - Q&A pairs
- *_sharegpt.jsonl - Chat format (best for training)
- *_summarization.jsonl - Summarization tasks

🧠 Model Training Flow

The training wizard now supports:

Gemma 2 Models: Recommended for best quality-to-cost ratio
Custom Pre-trained: Your own pre-trained models
Fine-tuned Adapters: Continue training existing adapters

Supported Base Models (Gemma First):

Model	Size	Best For
⭐ google/gemma-2-2b	2B	Recommended - Efficient, CPU-trainable
⭐ google/gemma-2-9b	9B	Production-ready, high quality
google/gemma-2b	2B	General Purpose
google/gemma-7b	7B	Higher capacity
sarvamai/sarvam-1	2B	Indian Languages
TinyLlama/TinyLlama-1.1B	1.1B	Fast Testing

Output: models/{model-name}-finetuned/final_adapter/

🧪 Model Evaluation Flow (Gemini-Powered)

Uses Gemini 2.0 Flash to evaluate your fine-tuned model:

Runs test prompts through your model
Scores each response (1-10) using Gemini
Generates improved responses for low scores
Creates correction data for next training round

Self-Improvement Cycle:

Train Model → Evaluate (Gemini 2.0) → Generate Corrections → Retrain → Repeat

🚀 Model Deployment Flow

Option	Platform	Description
1	Local Chat	Interactive terminal chat
2	Ollama Export	Convert to GGUF format
3	HuggingFace	Push to HF Hub
4	Cloud Deploy	Docker + Google Cloud Run
5	Merge Model	Merge adapter with base

📟 CLI Commands

Core Commands

Command	Description
`saara run`	Start interactive wizard
`saara pretrain`	Build & train model from scratch
`saara setup`	First-time hardware detection & model setup
`saara version`	Show version information

Data Processing

Command	Description
`saara process <file>`	Process a single PDF file
`saara batch <dir>`	Process all PDFs in directory
`saara distill <input>`	Generate synthetic training data

Model Operations

Command	Description
`saara train`	Fine-tune a model (interactive)
`saara deploy`	Deploy a trained model
`saara evaluate <base> <adapter>`	Evaluate model quality

Model Management

Command	Description
`saara models list`	List all available models
`saara models install <name>`	Install an Ollama model
`saara models remove <name>`	Remove a model
`saara models status`	Show hardware & model status

Server

Command	Description
`saara serve`	Start REST API server

📁 Project Structure

Data-engine/
├── setup.py                # Package setup
├── config.yaml             # Configuration settings
├── requirements.txt        # Dependencies
├── saara/                  # Source code
│   ├── cli.py             # CLI entry point
│   ├── pipeline.py         # Core data pipeline
│   ├── pretrain.py         # Pre-training module (NEW)
│   ├── train.py            # LLM fine-tuning module
│   ├── evaluator.py        # Model evaluation
│   ├── deployer.py         # Deployment utilities
│   ├── distiller.py        # Data cleaning
│   ├── model_manager.py    # Ollama model management
│   └── splash.py           # SAARA splash screen
├── models/                 # Saved models (pre-trained & fine-tuned)
├── datasets/               # Generated datasets
├── tokenizers/             # Custom tokenizers
├── evaluations/            # Evaluation results
└── exports/                # Deployment artifacts

🔮 Roadmap

Vision-LLM OCR (Moondream, Qwen)
Autonomous data labeling
Multi-format dataset generation
Native fine-tuning with QLoRA
Model evaluation with Granite 4
Self-improvement training loop
Local & cloud deployment
Pre-training from scratch
Custom tokenizer training
Iterative adapter fine-tuning
Multi-modal dataset generation (images + text)
RAG-based factual verification
Web UI dashboard

📄 License

This software is provided under a proprietary license with the following terms:

✅ Permitted:

Use the software for personal, educational, or commercial purposes
Reference in academic/educational contexts with attribution

❌ Not Permitted:

Modify, alter, or create derivative works
Reproduce, copy, or duplicate the software
Distribute, sublicense, or sell the software
Reverse engineer or decompile the software

See the LICENSE file for full details.

👤 Author

Kilani Sai Nikhil - GitHub

Built with ❤️ for the AI community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.6.9

May 6, 2026

1.6.8

Apr 18, 2026

1.6.7

Mar 21, 2026

1.6.6

Mar 21, 2026

1.6.5

Mar 21, 2026

1.6.4

Jan 24, 2026

1.6.3

Jan 24, 2026

1.6.2

Jan 24, 2026

1.6.1

Jan 2, 2026

1.6.0

Dec 31, 2025

1.5.1

Dec 31, 2025

This version

1.5.0

Dec 31, 2025

1.3.2

Dec 30, 2025

1.3.1

Dec 30, 2025

1.3.0

Dec 29, 2025

1.2.17

Dec 28, 2025

1.2.16

Dec 28, 2025

1.2.15

Dec 28, 2025

1.2.14

Dec 28, 2025

1.2.13

Dec 28, 2025

1.2.12

Dec 28, 2025

1.2.11

Dec 28, 2025

1.2.10

Dec 28, 2025

1.2.9

Dec 28, 2025

1.2.8

Dec 28, 2025

1.2.5

Dec 28, 2025

1.2.4

Dec 28, 2025

1.2.3

Dec 28, 2025

1.2.2

Dec 28, 2025

1.2.0

Dec 28, 2025

1.0.0

Dec 28, 2025

0.1.5

Apr 17, 2026

0.1.4

Apr 17, 2026

0.1.3

Apr 17, 2026

0.1.2

Apr 17, 2026

0.1.1

Apr 17, 2026

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saara_ai-1.5.0.tar.gz (117.2 kB view details)

Uploaded Dec 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

saara_ai-1.5.0-py3-none-any.whl (122.0 kB view details)

Uploaded Dec 31, 2025 Python 3

File details

Details for the file saara_ai-1.5.0.tar.gz.

File metadata

Download URL: saara_ai-1.5.0.tar.gz
Upload date: Dec 31, 2025
Size: 117.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for saara_ai-1.5.0.tar.gz
Algorithm	Hash digest
SHA256	`622781db7381341b3479d72c046b13ea7b112d4c98c4318bea34b1c1f5c8c8a8`
MD5	`b8a9d686afe684d79984653904bb63a9`
BLAKE2b-256	`baf63c2c9a8264c076a84a23d00e67cc134219c70be807dedf37635f5543b4f9`

See more details on using hashes here.

File details

Details for the file saara_ai-1.5.0-py3-none-any.whl.

File metadata

Download URL: saara_ai-1.5.0-py3-none-any.whl
Upload date: Dec 31, 2025
Size: 122.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for saara_ai-1.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7892426c54a6266bf55cf3e323b6b58ef85339a0bcca1bdcad7459dabce09d07`
MD5	`1aae6b80477034ddcd0d21211d871b62`
BLAKE2b-256	`084e1ff737922b5735afdbdd745af8194b5832fa5b2516fbc6c584adeacb1cd5`

See more details on using hashes here.

saara-ai 1.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧠 SAARA: Autonomous Document-to-LLM Data Engine

🌟 Gemini & Gemma Integration

Gemini 2.0 Flash - AI Teacher & Evaluator

Gemma 2 - Fine-Tuning Targets

🚀 Key Features

1. 👁️ SOTA Vision-LLM OCR

2. 🤖 Autonomous Data Labeling (Gemini-Powered)

3. 🧪 Data Distillation & Hygiene

4. 🏗️ Pre-training from Scratch

5. 🎓 Native Fine-Tuning Support (Gemma Optimized)

6. 🧪 Model Evaluation & Self-Improvement (Gemini Judge)

7. 🚀 Model Deployment

🛠️ Architecture

📦 Installation

Quick Start

⚡ Usage

🎯 Interactive Wizard (Recommended)

🏗️ Pre-training from Scratch (NEW)

📄 Dataset Creation Flow

🧠 Model Training Flow

🧪 Model Evaluation Flow (Gemini-Powered)

🚀 Model Deployment Flow

📟 CLI Commands

Core Commands

Data Processing

Model Operations

Model Management

Server

📁 Project Structure

🔮 Roadmap

📄 License

👤 Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes