End-to-end ML pipeline: PDF→JSON datasets, synthetic data, fine-tuning, evaluation, and edge deployment with Gradio GUI
Project description
SAARA
End-to-end ML pipeline for dataset creation, fine-tuning, evaluation, and edge deployment with Gradio GUI
Features
- 📄 PDF to JSON Dataset Creation - Extract text from PDFs and generate structured training data using local LLMs (vLLM, Ollama, llama.cpp)
- 🤖 Synthetic Data Generation - Create high-quality training data in multiple formats (Factual, Reasoning, Conversational, Instruction, Code, Creative)
- 🎯 Fine-Tuning - QLoRA/LoRA fine-tuning with Unsloth for fast, memory-efficient training
- 📊 Comprehensive Evaluation - Teacher-student comparison, standard benchmarks (MMLU, GSM8K, HumanEval), performance metrics, power consumption tracking
- 📦 Model Export & Quantization - Export to multiple formats (GGUF, AWQ, GPTQ, ONNX, TensorRT, Safetensors) with 2/3/4/8-bit quantization
- 🖥️ Gradio GUI - Visual interface for the entire pipeline with auto-generation from Python scripts
Installation
# Clone repository
git clone https://github.com/nikhil49023/saara-ai.git
cd saara-ai
# Install package
pip install -e .
# For development
pip install -e ".[dev]"
# For edge deployment
pip install -e ".[edge]"
Quick Start
1. Launch GUI
saara gui
Or in Python:
from saara import SaaraDashboard
dashboard = SaaraDashboard()
dashboard.launch()
2. Create Dataset from PDF
from saara import DatasetBuilder
from saara.dataset.types import DataType
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig
# Setup provider
config = ProviderConfig(model="mistral", base_url="http://localhost:11434")
provider = OllamaProvider(config)
# Create dataset
builder = DatasetBuilder(provider)
samples = builder.from_pdf(
"document.pdf",
data_types=[DataType.INSTRUCTION, DataType.FACTUAL],
pairs_per_type=5,
)
# Save
builder.save(samples, "dataset.jsonl")
3. Fine-Tune Model
from saara import FineTuner
from saara.training.config import TrainingConfig
config = TrainingConfig(
model_name="mistralai/Mistral-7B-v0.1",
num_train_epochs=3,
use_lora=True,
)
finetuner = FineTuner(config)
finetuner.train("dataset.jsonl")
finetuner.save("./output/models/my-finetune")
4. Evaluate with Teacher Comparison
from saara import ModelEvaluator
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig
student = OllamaProvider(ProviderConfig(model="mistral-7b-finetuned"))
teacher = OllamaProvider(ProviderConfig(model="llama-3-70b"))
evaluator = ModelEvaluator(student, teacher)
metrics = evaluator.evaluate(
"test.jsonl",
run_benchmarks=True,
benchmark_names=["mmlu", "gsm8k"],
)
print(metrics.summary())
5. Export & Quantize
from saara import ModelExporter
from saara.export.formats import ExportFormat
from saara.export.quantization import QuantizationConfig
config = QuantizationConfig(bits=4)
exporter = ModelExporter("./output/models/my-finetune", config)
results = exporter.export(
"./output/exports",
formats=[ExportFormat.GGUF, ExportFormat.AWQ, ExportFormat.ONNX],
quantize=True,
)
CLI Commands
# Dataset creation
saara dataset from-pdf document.pdf -o ./output --data-types instruction factual
saara dataset from-text text.txt -o ./output
# Training
saara train finetune dataset.jsonl --model mistralai/Mistral-7B-v0.1 --epochs 3
# Evaluation
saara eval model test.jsonl --model mistral --teacher llama-3-70b --benchmarks mmlu gsm8k
# Export
saara export model ./models/final --formats gguf awq --quantize --bits 4
# GUI
saara gui --port 7860
Examples
See the examples/ directory for complete workflows:
01_pdf_to_dataset.py- Extract and create dataset from PDF02_synthetic_data.py- Generate synthetic training data03_finetune.py- Fine-tune a model with QLoRA04_evaluate.py- Evaluate with teacher comparison05_export.py- Export to multiple formats06_complete_pipeline.py- End-to-end workflow07_gui.py- Launch Gradio GUI
Architecture
saara/
├── providers/ # Model providers (vLLM, Ollama, llama.cpp)
├── dataset/ # Dataset creation (PDF extraction, synthetic generation)
├── training/ # Fine-tuning pipelines (QLoRA, LoRA)
├── evaluation/ # Evaluation (benchmarks, teacher-student, power)
├── export/ # Export & quantization (GGUF, AWQ, GPTQ, ONNX, TensorRT)
├── gui/ # Gradio components and dashboard
├── cli/ # Command-line interface
└── utils/ # Utilities (I/O, logging, memory)
Supported Formats
Dataset Output
- Alpaca
- ChatML
- ShareGPT
- DPO
- Completion
- JSONL
Model Export
- Safetensors - HuggingFace compatible
- GGUF - llama.cpp CPU/metal inference
- AWQ - NVIDIA GPU optimized (4-bit)
- GPTQ - Quantized inference
- ONNX - Cross-platform deployment
- TensorRT - NVIDIA Jetson edge deployment
Benchmarks
Standard benchmarks supported:
- MMLU (Multi-task Language Understanding)
- GSM8K (Grade School Math)
- HumanEval (Code Generation)
- BoolQ (Boolean Questions)
- HellaSwag (Commonsense NLI)
- TruthfulQA
- WinoGrande
- ARC Easy/Challenge
Metrics Tracked
- Accuracy - Task performance
- Perplexity - Language modeling quality
- Speed - Tokens/sec, latency
- Memory - VRAM usage
- Power - Watts, energy, carbon footprint
- Teacher Agreement - Student-teacher alignment
- Hallucination Rate - Factual accuracy
Requirements
- Python 3.10+
- CUDA 11.8+ (for GPU features)
- 8GB+ VRAM recommended for fine-tuning
- 16GB+ VRAM for larger models
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black saara/
ruff check saara/
# Type checking
mypy saara/
License
MIT License - see LICENSE
Contributing
Contributions welcome! Please read our contributing guidelines before submitting PRs.
Citation
@software{saara2024,
title = {SAARA: End-to-End ML Pipeline},
author = {Kilani Sai Nikhil},
year = {2024},
url = {https://github.com/nikhil49023/saara-ai},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saara_ai-0.1.0.tar.gz.
File metadata
- Download URL: saara_ai-0.1.0.tar.gz
- Upload date:
- Size: 35.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73683d47d747c2dd54794e87b8bfb6409ae60d7e4504f9f08759dc715bc23942
|
|
| MD5 |
46fa3b95ed3d122133e8f115f19d430a
|
|
| BLAKE2b-256 |
30565c4e724117137a7cf8b4c79026efcc17a9ef687a3a9d1b85b6c947c9e57f
|
File details
Details for the file saara_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: saara_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd5582d82adbc55933b1e4aad113f79d082b47735ea9891ebfd28909386edabe
|
|
| MD5 |
91481a85838ed83de962ba3eba42fde4
|
|
| BLAKE2b-256 |
527087495f3675eae980dfeaab8dd10f6e147303d0889d8cb6196b9b7f92eb0a
|