๐ง SAARA - Autonomous Document-to-LLM Data Engine. Pre-train, fine-tune, evaluate, and deploy LLMs.
Project description
๐ง SAARA: Autonomous Document-to-LLM Data Engine
SAARA is an end-to-end autonomous data pipeline designed to transform raw, unstructured documents (PDFs, research papers) into high-quality, instruction-tuned datasets for fine-tuning Large Language Models (LLMs).
Why this exists: Creating high-quality datasets is the bottleneck in training domain-specific AI. This tool automates the "boring stuff"โOCR, chunking, labeling, and cleaningโallowing you to go from PDF to fine-tuned model in hours, not weeks.
๐ Key Features
1. ๐๏ธ SOTA Vision-LLM OCR
- No more Garbled Text: Uses Moondream and Qwen2.5-VL (Vision-Language Models) to "read" PDFs visually.
- Handles complex double-column layouts, tables, and scientific diagrams that traditional OCR (Tesseract) fails on.
- Hybrid Fallback: Automatically switches between PyMuPDF (fast) and Vision OCR (accurate) based on page extractability.
2. ๐ค Autonomous Data Labeling
- Uses local LLMs (Granite 4.0, Llama 3) to generate diverse training tasks:
- Instruction Tuning: "How do I treat X using Ayurveda?"
- Q&A Pairs: Fact-based extraction.
- Summarization: TL;DRs of complex sections.
- Classification: Topic tagging.
3. ๐งช Data Distillation & Hygiene
- Self-Cleaning: The
distillmodule removes low-quality generations, duplicates, and confabulations. - ShareGPT Formatting: Automatically converts raw data into the industry-standard conversation format.
4. ๐๏ธ Pre-training from Scratch (NEW)
- Build Your Own LLM: Create custom models from 15M to 3B parameters.
- Custom Tokenizers: Train domain-specific BPE tokenizers on your data.
- Full Pipeline: Pre-train โ Fine-tune โ Evaluate โ Deploy.
- Production-ready LLaMA-style architectures.
5. ๐ Native Fine-Tuning Support
- One-Command Training: Built-in training loop using
SFTTrainer(QLoRA) to fine-tune any HuggingFace model. - Multi-Format Support: Automatically handles ShareGPT, Alpaca, and Raw Text formats.
- Checkpoint Resume: Continue training from any checkpoint.
- Iterative Fine-tuning: Fine-tune your already fine-tuned models to keep improving.
- Optimized for consumer GPUs (supports 4-bit quantization).
6. ๐งช Model Evaluation & Self-Improvement
- Granite 4 as Judge: Test your fine-tuned model with automatic quality scoring.
- Self-Improvement Loop: Low-scoring responses are corrected and used for next training round.
- Iterative Enhancement: Train โ Evaluate โ Improve โ Repeat.
7. ๐ Model Deployment
- Local Chat: Interactive terminal testing with your model.
- Ollama Export: Convert to GGUF format for Ollama usage.
- HuggingFace Hub: Push your model to share with the community.
- Cloud Deployment: Docker + Google Cloud Run ready.
๐ ๏ธ Architecture
graph LR
A[Raw PDF] --> B(Vision OCR / Extractor)
B --> C{Chunker Strategy}
C --> D[Synthetic Labeling Agent]
D --> E[Raw Dataset JSONL]
E --> F(Data Distiller)
F --> G[Clean ShareGPT Dataset]
G --> H{Training Path}
H -->|Pre-train| I[Build New Model]
H -->|Fine-tune| J[Adapt Existing Model]
I --> K[Model Evaluation]
J --> K
K --> L{Score < 7?}
L -->|Yes| M[Generate Corrections]
M --> J
L -->|No| N((Deploy Model))
๐ฆ Installation
-
Clone the repository:
git clone https://github.com/nikhil49023/Data-engine.git cd Data-engine
-
Install the CLI:
pip install -e .
-
Setup Ollama:
- Install Ollama
- The setup wizard will help you install models automatically
Quick Start
First-time setup (recommended):
saara setup
The setup wizard will:
- โ Detect your hardware (GPU, VRAM, RAM)
- โ Recommend optimal models for your system
- โ Install selected vision and analyzer models
- โ Save configuration
โก Usage
๐ฏ Interactive Wizard (Recommended)
saara run
This launches a beautiful CLI wizard with 5 workflows:
| Option | Mode | Description |
|---|---|---|
| 1 | ๐ Dataset Creation | Extract data from PDFs โ Generate training datasets |
| 2 | ๐ง Model Training | Fine-tune LLMs on your prepared data |
| 3 | ๐งช Model Evaluation | Test & improve models with Granite 4 |
| 4 | ๐ Model Deployment | Deploy locally (Ollama) or to cloud |
| 5 | ๐๏ธ Pre-training | Build & train a model from scratch |
๐๏ธ Pre-training from Scratch (NEW)
Build your own language model from the ground up:
saara pretrain
Available Architectures:
| Name | Parameters | VRAM | Use Case |
|---|---|---|---|
| Nano | ~15M | 2GB+ | Testing, learning (CPU trainable) |
| Micro | ~50M | 4GB+ | Experimentation |
| Mini | ~125M | 6GB+ | Domain-specific pre-training |
| Small | ~350M | 8GB+ | Specialized tasks |
| Base | ~1B | 16GB+ | Production models |
| Large | ~3B | 24GB+ | High-capacity models |
Pre-training Sub-menu:
- ๐ Create Pre-training Dataset
- ๐๏ธ Build & Train New Model
- ๐ค Train Custom Tokenizer
- ๐งช Test Pre-trained Model
- ๐ List Pre-trained Models
Pre-training Dataset Creation:
- Extracts raw text from PDFs, markdown, and text files
- Cleans OCR artifacts and normalizes unicode
- Chunks text into optimal sizes for language modeling
- LLM-Enhanced Processing (Optional):
- Uses local LLM (Granite 4, Llama 3, Qwen) to clean and improve text
- Fixes OCR errors and expands abbreviations
- LLM-based quality scoring for more accurate filtering
- Quality filtering (removes low-quality/incoherent text)
- Deduplication (prevents model memorization)
- Outputs in JSONL format ready for training
- Optional train/validation split
Workflow:
Create Dataset โ Train Tokenizer (optional) โ Pre-train Model โ Test โ Fine-tune โ Deploy
๐ Dataset Creation Flow
- Select input PDF folder and output directory
- Choose Vision OCR model (Moondream/Qwen) - auto-detects available models
- Choose Analyzer model (Granite 4/Llama 3/Qwen 2.5/Mistral)
- Configure advanced options (chunk size, Q&A density)
- Pipeline automatically generates:
*_instruction.jsonl- Instruction tuning data*_qa.jsonl- Q&A pairs*_sharegpt.jsonl- Chat format (best for training)*_summarization.jsonl- Summarization tasks
๐ง Model Training Flow
The training wizard now supports:
- Base Models: HuggingFace models (Gemma, Llama, Qwen, etc.)
- Custom Pre-trained: Your own pre-trained models
- Fine-tuned Adapters: Continue training existing adapters
Supported Base Models:
| Model | Size | Best For |
|---|---|---|
| sarvamai/sarvam-1 | 2B | Indian Languages |
| google/gemma-2b | 2B | General Purpose |
| TinyLlama/TinyLlama-1.1B | 1.1B | Fast Testing |
| meta-llama/Llama-3.2-1B | 1B | English Tasks |
| Qwen/Qwen2.5-7B | 7B | Complex Reasoning |
Output: models/{model-name}-finetuned/final_adapter/
๐งช Model Evaluation Flow
Uses Granite 4 to evaluate your fine-tuned model:
- Runs test prompts through your model
- Scores each response (1-10)
- Generates improved responses for low scores
- Creates correction data for next training round
Self-Improvement Cycle:
Train Model โ Evaluate (Granite 4) โ Generate Corrections โ Retrain โ Repeat
๐ Model Deployment Flow
| Option | Platform | Description |
|---|---|---|
| 1 | Local Chat | Interactive terminal chat |
| 2 | Ollama Export | Convert to GGUF format |
| 3 | HuggingFace | Push to HF Hub |
| 4 | Cloud Deploy | Docker + Google Cloud Run |
| 5 | Merge Model | Merge adapter with base |
๐ CLI Commands
Core Commands
| Command | Description |
|---|---|
saara run |
Start interactive wizard |
saara pretrain |
Build & train model from scratch |
saara setup |
First-time hardware detection & model setup |
saara version |
Show version information |
Data Processing
| Command | Description |
|---|---|
saara process <file> |
Process a single PDF file |
saara batch <dir> |
Process all PDFs in directory |
saara distill <input> |
Generate synthetic training data |
Model Operations
| Command | Description |
|---|---|
saara train |
Fine-tune a model (interactive) |
saara deploy |
Deploy a trained model |
saara evaluate <base> <adapter> |
Evaluate model quality |
Model Management
| Command | Description |
|---|---|
saara models list |
List all available models |
saara models install <name> |
Install an Ollama model |
saara models remove <name> |
Remove a model |
saara models status |
Show hardware & model status |
Server
| Command | Description |
|---|---|
saara serve |
Start REST API server |
๐ Project Structure
Data-engine/
โโโ setup.py # Package setup
โโโ config.yaml # Configuration settings
โโโ requirements.txt # Dependencies
โโโ saara/ # Source code
โ โโโ cli.py # CLI entry point
โ โโโ pipeline.py # Core data pipeline
โ โโโ pretrain.py # Pre-training module (NEW)
โ โโโ train.py # LLM fine-tuning module
โ โโโ evaluator.py # Model evaluation
โ โโโ deployer.py # Deployment utilities
โ โโโ distiller.py # Data cleaning
โ โโโ model_manager.py # Ollama model management
โ โโโ splash.py # SAARA splash screen
โโโ models/ # Saved models (pre-trained & fine-tuned)
โโโ datasets/ # Generated datasets
โโโ tokenizers/ # Custom tokenizers
โโโ evaluations/ # Evaluation results
โโโ exports/ # Deployment artifacts
๐ฎ Roadmap
- Vision-LLM OCR (Moondream, Qwen)
- Autonomous data labeling
- Multi-format dataset generation
- Native fine-tuning with QLoRA
- Model evaluation with Granite 4
- Self-improvement training loop
- Local & cloud deployment
- Pre-training from scratch
- Custom tokenizer training
- Iterative adapter fine-tuning
- Multi-modal dataset generation (images + text)
- RAG-based factual verification
- Web UI dashboard
๐ License
Proprietary License - Copyright ยฉ 2024-2025 Kilani Sai Nikhil. All Rights Reserved.
This software is provided under a proprietary license with the following terms:
โ Permitted:
- Use the software for personal, educational, or commercial purposes
- Reference in academic/educational contexts with attribution
โ Not Permitted:
- Modify, alter, or create derivative works
- Reproduce, copy, or duplicate the software
- Distribute, sublicense, or sell the software
- Reverse engineer or decompile the software
See the LICENSE file for full details.
๐ค Author
Kilani Sai Nikhil - GitHub
Built with โค๏ธ for the AI community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saara_ai-1.3.2.tar.gz.
File metadata
- Download URL: saara_ai-1.3.2.tar.gz
- Upload date:
- Size: 103.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94e397b402a47ed1dc83c291d3e1c7e53f361a60fb198d0182469d869b5c94ee
|
|
| MD5 |
13b148c672967b96f69a79613db17b4d
|
|
| BLAKE2b-256 |
0d79603f5e20989c8363342ad1445f737de9ed8485e3f28e3710ee5921fcb922
|
File details
Details for the file saara_ai-1.3.2-py3-none-any.whl.
File metadata
- Download URL: saara_ai-1.3.2-py3-none-any.whl
- Upload date:
- Size: 108.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d9eb5d3b592b158fcbcefbe857b6792c42a8a5ad0a1497b1de7cd126e88f7e1
|
|
| MD5 |
ad1f6d6f89c733fdf53ca1bc3e16a4a8
|
|
| BLAKE2b-256 |
e3e556007d527728e76e5f1686690a629295541c18360c8b70e2d01fd9880134
|