End-to-end ML pipeline: PDF→JSON datasets, synthetic data, fine-tuning, evaluation, and edge deployment with Gradio GUI

These details have not been verified by PyPI

Project links

Project description

SAARA

End-to-end ML pipeline for dataset creation, fine-tuning, evaluation, and edge deployment with Gradio GUI

Features

📄 PDF to JSON Dataset Creation - Extract text from PDFs and generate structured training data using local LLMs (vLLM, Ollama, llama.cpp)
🤖 Synthetic Data Generation - Create high-quality training data in multiple formats (Factual, Reasoning, Conversational, Instruction, Code, Creative)
🎯 Fine-Tuning - QLoRA/LoRA fine-tuning with Unsloth for fast, memory-efficient training
📊 Comprehensive Evaluation - Teacher-student comparison, standard benchmarks (MMLU, GSM8K, HumanEval), performance metrics, power consumption tracking
📦 Model Export & Quantization - Export to multiple formats (GGUF, AWQ, GPTQ, ONNX, TensorRT, Safetensors) with 2/3/4/8-bit quantization
🖥️ Gradio GUI - Visual interface for the entire pipeline with auto-generation from Python scripts

Installation

# Clone repository
git clone https://github.com/nikhil49023/saara-ai.git
cd saara-ai

# Install package
pip install -e .

# For development
pip install -e ".[dev]"

# For edge deployment
pip install -e ".[edge]"

Quick Start

1. Launch GUI

saara gui

Or in Python:

from saara import SaaraDashboard

dashboard = SaaraDashboard()
dashboard.launch()

2. Create Dataset from PDF

from saara import DatasetBuilder
from saara.dataset.types import DataType
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

# Setup provider
config = ProviderConfig(model="mistral", base_url="http://localhost:11434")
provider = OllamaProvider(config)

# Create dataset
builder = DatasetBuilder(provider)
samples = builder.from_pdf(
    "document.pdf",
    data_types=[DataType.INSTRUCTION, DataType.FACTUAL],
    pairs_per_type=5,
)

# Save
builder.save(samples, "dataset.jsonl")

3. Fine-Tune Model

from saara import FineTuner
from saara.training.config import TrainingConfig

config = TrainingConfig(
    model_name="mistralai/Mistral-7B-v0.1",
    num_train_epochs=3,
    use_lora=True,
)

finetuner = FineTuner(config)
finetuner.train("dataset.jsonl")
finetuner.save("./output/models/my-finetune")

4. Evaluate with Teacher Comparison

from saara import ModelEvaluator
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

student = OllamaProvider(ProviderConfig(model="mistral-7b-finetuned"))
teacher = OllamaProvider(ProviderConfig(model="llama-3-70b"))

evaluator = ModelEvaluator(student, teacher)
metrics = evaluator.evaluate(
    "test.jsonl",
    run_benchmarks=True,
    benchmark_names=["mmlu", "gsm8k"],
)

print(metrics.summary())

5. Export & Quantize

from saara import ModelExporter
from saara.export.formats import ExportFormat
from saara.export.quantization import QuantizationConfig

config = QuantizationConfig(bits=4)
exporter = ModelExporter("./output/models/my-finetune", config)

results = exporter.export(
    "./output/exports",
    formats=[ExportFormat.GGUF, ExportFormat.AWQ, ExportFormat.ONNX],
    quantize=True,
)

CLI Commands

# Dataset creation
saara dataset from-pdf document.pdf -o ./output --data-types instruction factual
saara dataset from-text text.txt -o ./output

# Training
saara train finetune dataset.jsonl --model mistralai/Mistral-7B-v0.1 --epochs 3

# Evaluation
saara eval model test.jsonl --model mistral --teacher llama-3-70b --benchmarks mmlu gsm8k

# Export
saara export model ./models/final --formats gguf awq --quantize --bits 4

# GUI
saara gui --port 7860

Examples

See the examples/ directory for complete workflows:

01_pdf_to_dataset.py - Extract and create dataset from PDF
02_synthetic_data.py - Generate synthetic training data
03_finetune.py - Fine-tune a model with QLoRA
04_evaluate.py - Evaluate with teacher comparison
05_export.py - Export to multiple formats
06_complete_pipeline.py - End-to-end workflow
07_gui.py - Launch Gradio GUI

Architecture

saara/
├── providers/      # Model providers (vLLM, Ollama, llama.cpp)
├── dataset/        # Dataset creation (PDF extraction, synthetic generation)
├── training/       # Fine-tuning pipelines (QLoRA, LoRA)
├── evaluation/     # Evaluation (benchmarks, teacher-student, power)
├── export/         # Export & quantization (GGUF, AWQ, GPTQ, ONNX, TensorRT)
├── gui/            # Gradio components and dashboard
├── cli/            # Command-line interface
└── utils/          # Utilities (I/O, logging, memory)

Supported Formats

Dataset Output

Alpaca
ChatML
ShareGPT
DPO
Completion
JSONL

Model Export

Safetensors - HuggingFace compatible
GGUF - llama.cpp CPU/metal inference
AWQ - NVIDIA GPU optimized (4-bit)
GPTQ - Quantized inference
ONNX - Cross-platform deployment
TensorRT - NVIDIA Jetson edge deployment

Benchmarks

Standard benchmarks supported:

MMLU (Multi-task Language Understanding)
GSM8K (Grade School Math)
HumanEval (Code Generation)
BoolQ (Boolean Questions)
HellaSwag (Commonsense NLI)
TruthfulQA
WinoGrande
ARC Easy/Challenge

Metrics Tracked

Accuracy - Task performance
Perplexity - Language modeling quality
Speed - Tokens/sec, latency
Memory - VRAM usage
Power - Watts, energy, carbon footprint
Teacher Agreement - Student-teacher alignment
Hallucination Rate - Factual accuracy

Requirements

Python 3.10+
CUDA 11.8+ (for GPU features)
8GB+ VRAM recommended for fine-tuning
16GB+ VRAM for larger models

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black saara/
ruff check saara/

# Type checking
mypy saara/

License

MIT License - see LICENSE

Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.

Citation

@software{saara2024,
  title = {SAARA: End-to-End ML Pipeline},
  author = {Kilani Sai Nikhil},
  year = {2024},
  url = {https://github.com/nikhil49023/saara-ai},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.6.9

May 6, 2026

1.6.8

Apr 18, 2026

1.6.7

Mar 21, 2026

1.6.6

Mar 21, 2026

1.6.5

Mar 21, 2026

1.6.4

Jan 24, 2026

1.6.3

Jan 24, 2026

1.6.2

Jan 24, 2026

1.6.1

Jan 2, 2026

1.6.0

Dec 31, 2025

1.5.1

Dec 31, 2025

1.5.0

Dec 31, 2025

1.3.2

Dec 30, 2025

1.3.1

Dec 30, 2025

1.3.0

Dec 29, 2025

1.2.17

Dec 28, 2025

1.2.16

Dec 28, 2025

1.2.15

Dec 28, 2025

1.2.14

Dec 28, 2025

1.2.13

Dec 28, 2025

1.2.12

Dec 28, 2025

1.2.11

Dec 28, 2025

1.2.10

Dec 28, 2025

1.2.9

Dec 28, 2025

1.2.8

Dec 28, 2025

1.2.5

Dec 28, 2025

1.2.4

Dec 28, 2025

1.2.3

Dec 28, 2025

1.2.2

Dec 28, 2025

1.2.0

Dec 28, 2025

1.0.0

Dec 28, 2025

0.1.5

Apr 17, 2026

0.1.4

Apr 17, 2026

0.1.3

Apr 17, 2026

0.1.2

Apr 17, 2026

0.1.1

Apr 17, 2026

This version

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saara_ai-0.1.0.tar.gz (35.1 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

saara_ai-0.1.0-py3-none-any.whl (40.4 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file saara_ai-0.1.0.tar.gz.

File metadata

Download URL: saara_ai-0.1.0.tar.gz
Upload date: Apr 17, 2026
Size: 35.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`73683d47d747c2dd54794e87b8bfb6409ae60d7e4504f9f08759dc715bc23942`
MD5	`46fa3b95ed3d122133e8f115f19d430a`
BLAKE2b-256	`30565c4e724117137a7cf8b4c79026efcc17a9ef687a3a9d1b85b6c947c9e57f`

See more details on using hashes here.

File details

Details for the file saara_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: saara_ai-0.1.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 40.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd5582d82adbc55933b1e4aad113f79d082b47735ea9891ebfd28909386edabe`
MD5	`91481a85838ed83de962ba3eba42fde4`
BLAKE2b-256	`527087495f3675eae980dfeaab8dd10f6e147303d0889d8cb6196b9b7f92eb0a`

See more details on using hashes here.

saara-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SAARA

Features

Installation

Quick Start

1. Launch GUI

2. Create Dataset from PDF

3. Fine-Tune Model

4. Evaluate with Teacher Comparison

5. Export & Quantize

CLI Commands

Examples

Architecture

Supported Formats

Dataset Output

Model Export

Benchmarks

Metrics Tracked

Requirements

Development

License

Contributing

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes