Skip to main content

End-to-end ML pipeline: PDF→JSON datasets, synthetic data, fine-tuning, evaluation, and edge deployment with Gradio GUI

Project description

SAARA

End-to-end ML pipeline for dataset creation, fine-tuning, evaluation, and edge deployment with Gradio GUI

Features

  • 📄 PDF to JSON Dataset Creation - Extract text from PDFs and generate structured training data using local LLMs (vLLM, Ollama, llama.cpp)
  • 🤖 Synthetic Data Generation - Create high-quality training data in multiple formats (Factual, Reasoning, Conversational, Instruction, Code, Creative)
  • 🎯 Fine-Tuning - QLoRA/LoRA fine-tuning with Unsloth for fast, memory-efficient training
  • 📊 Comprehensive Evaluation - Teacher-student comparison, standard benchmarks (MMLU, GSM8K, HumanEval), performance metrics, power consumption tracking
  • 📦 Model Export & Quantization - Export to multiple formats (GGUF, AWQ, GPTQ, ONNX, TensorRT, Safetensors) with 2/3/4/8-bit quantization
  • 🖥️ Gradio GUI - Visual interface for the entire pipeline with auto-generation from Python scripts

Installation

# Clone repository
git clone https://github.com/nikhil49023/saara-ai.git
cd saara-ai

# Install package
pip install -e .

# For development
pip install -e ".[dev]"

# For edge deployment
pip install -e ".[edge]"

Quick Start

1. Launch GUI

saara gui

Or in Python:

from saara import SaaraDashboard

dashboard = SaaraDashboard()
dashboard.launch()

2. Create Dataset from PDF

from saara import DatasetBuilder
from saara.dataset.types import DataType
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

# Setup provider
config = ProviderConfig(model="mistral", base_url="http://localhost:11434")
provider = OllamaProvider(config)

# Create dataset
builder = DatasetBuilder(provider)
samples = builder.from_pdf(
    "document.pdf",
    data_types=[DataType.INSTRUCTION, DataType.FACTUAL],
    pairs_per_type=5,
)

# Save
builder.save(samples, "dataset.jsonl")

3. Fine-Tune Model

from saara import FineTuner
from saara.training.config import TrainingConfig

config = TrainingConfig(
    model_name="mistralai/Mistral-7B-v0.1",
    num_train_epochs=3,
    use_lora=True,
)

finetuner = FineTuner(config)
finetuner.train("dataset.jsonl")
finetuner.save("./output/models/my-finetune")

4. Evaluate with Teacher Comparison

from saara import ModelEvaluator
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

student = OllamaProvider(ProviderConfig(model="mistral-7b-finetuned"))
teacher = OllamaProvider(ProviderConfig(model="llama-3-70b"))

evaluator = ModelEvaluator(student, teacher)
metrics = evaluator.evaluate(
    "test.jsonl",
    run_benchmarks=True,
    benchmark_names=["mmlu", "gsm8k"],
)

print(metrics.summary())

5. Export & Quantize

from saara import ModelExporter
from saara.export.formats import ExportFormat
from saara.export.quantization import QuantizationConfig

config = QuantizationConfig(bits=4)
exporter = ModelExporter("./output/models/my-finetune", config)

results = exporter.export(
    "./output/exports",
    formats=[ExportFormat.GGUF, ExportFormat.AWQ, ExportFormat.ONNX],
    quantize=True,
)

CLI Commands

# Dataset creation
saara dataset from-pdf document.pdf -o ./output --data-types instruction factual
saara dataset from-text text.txt -o ./output

# Training
saara train finetune dataset.jsonl --model mistralai/Mistral-7B-v0.1 --epochs 3

# Evaluation
saara eval model test.jsonl --model mistral --teacher llama-3-70b --benchmarks mmlu gsm8k

# Export
saara export model ./models/final --formats gguf awq --quantize --bits 4

# GUI
saara gui --port 7860

Examples

See the examples/ directory for complete workflows:

  • 01_pdf_to_dataset.py - Extract and create dataset from PDF
  • 02_synthetic_data.py - Generate synthetic training data
  • 03_finetune.py - Fine-tune a model with QLoRA
  • 04_evaluate.py - Evaluate with teacher comparison
  • 05_export.py - Export to multiple formats
  • 06_complete_pipeline.py - End-to-end workflow
  • 07_gui.py - Launch Gradio GUI

Architecture

saara/
├── providers/      # Model providers (vLLM, Ollama, llama.cpp)
├── dataset/        # Dataset creation (PDF extraction, synthetic generation)
├── training/       # Fine-tuning pipelines (QLoRA, LoRA)
├── evaluation/     # Evaluation (benchmarks, teacher-student, power)
├── export/         # Export & quantization (GGUF, AWQ, GPTQ, ONNX, TensorRT)
├── gui/            # Gradio components and dashboard
├── cli/            # Command-line interface
└── utils/          # Utilities (I/O, logging, memory)

Supported Formats

Dataset Output

  • Alpaca
  • ChatML
  • ShareGPT
  • DPO
  • Completion
  • JSONL

Model Export

  • Safetensors - HuggingFace compatible
  • GGUF - llama.cpp CPU/metal inference
  • AWQ - NVIDIA GPU optimized (4-bit)
  • GPTQ - Quantized inference
  • ONNX - Cross-platform deployment
  • TensorRT - NVIDIA Jetson edge deployment

Benchmarks

Standard benchmarks supported:

  • MMLU (Multi-task Language Understanding)
  • GSM8K (Grade School Math)
  • HumanEval (Code Generation)
  • BoolQ (Boolean Questions)
  • HellaSwag (Commonsense NLI)
  • TruthfulQA
  • WinoGrande
  • ARC Easy/Challenge

Metrics Tracked

  • Accuracy - Task performance
  • Perplexity - Language modeling quality
  • Speed - Tokens/sec, latency
  • Memory - VRAM usage
  • Power - Watts, energy, carbon footprint
  • Teacher Agreement - Student-teacher alignment
  • Hallucination Rate - Factual accuracy

Requirements

  • Python 3.10+
  • CUDA 11.8+ (for GPU features)
  • 8GB+ VRAM recommended for fine-tuning
  • 16GB+ VRAM for larger models

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black saara/
ruff check saara/

# Type checking
mypy saara/

License

MIT License - see LICENSE

Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.

Citation

@software{saara2024,
  title = {SAARA: End-to-End ML Pipeline},
  author = {Kilani Sai Nikhil},
  year = {2024},
  url = {https://github.com/nikhil49023/saara-ai},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saara_ai-0.1.0.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saara_ai-0.1.0-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file saara_ai-0.1.0.tar.gz.

File metadata

  • Download URL: saara_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 73683d47d747c2dd54794e87b8bfb6409ae60d7e4504f9f08759dc715bc23942
MD5 46fa3b95ed3d122133e8f115f19d430a
BLAKE2b-256 30565c4e724117137a7cf8b4c79026efcc17a9ef687a3a9d1b85b6c947c9e57f

See more details on using hashes here.

File details

Details for the file saara_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: saara_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd5582d82adbc55933b1e4aad113f79d082b47735ea9891ebfd28909386edabe
MD5 91481a85838ed83de962ba3eba42fde4
BLAKE2b-256 527087495f3675eae980dfeaab8dd10f6e147303d0889d8cb6196b9b7f92eb0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page