Skip to main content

End-to-end ML pipeline: PDF→JSON datasets, synthetic data, fine-tuning, evaluation, and edge deployment with Gradio GUI

Project description

SAARA

SAARA is a Python package for building local LLM workflows around data creation, fine-tuning, evaluation, export, and visualization.

It is aimed at a practical teacher-student workflow:

  1. extract text from PDFs or raw text
  2. generate synthetic supervision with a stronger local model
  3. fine-tune a smaller model
  4. evaluate quality, speed, memory, and optional teacher agreement
  5. export and quantize for deployment
  6. inspect the workflow through CLI or Gradio

Features

  • PDF to text extraction with PyMuPDF and pdfplumber
  • synthetic dataset generation for factual, reasoning, conversational, instruction, code, and creative tasks
  • local provider abstraction for Ollama, vLLM, and llama.cpp
  • fine-tuning helpers for LoRA and QLoRA workflows
  • evaluation helpers for custom datasets, benchmarks, and teacher-student comparison
  • export helpers for safetensors, GGUF, AWQ, GPTQ, ONNX, TensorRT, and PyTorch formats
  • Gradio dashboard and CLI entrypoints

Install

From PyPI:

pip install saara-ai

Common installs:

pip install "saara-ai[training]"
pip install "saara-ai[export]"
pip install "saara-ai[providers,training,evaluation,export]"

From source:

git clone https://github.com/nikhil49023/saara-ai.git
cd saara-ai
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Optional extras:

pip install -e ".[dev]"
pip install -e ".[edge]"
pip install -e ".[providers,training,evaluation,export]"

Base saara-ai intentionally keeps the install lighter. Heavy native stacks such as auto-gptq, autoawq, llama-cpp-python, vllm, and training toolchains are opt-in extras.

Note: a virtual environment is strongly recommended.

Quick Start

1. Start a local model provider

For Ollama:

ollama pull mistral
ollama serve

2. Build a dataset from a PDF

from saara import DatasetBuilder
from saara.dataset.types import DataType
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

provider = OllamaProvider(
    ProviderConfig(model="mistral", base_url="http://localhost:11434")
)

builder = DatasetBuilder(provider)
samples = builder.from_pdf(
    "document.pdf",
    data_types=[DataType.INSTRUCTION, DataType.FACTUAL],
    pairs_per_type=5,
    min_quality=0.65,
)

builder.save(samples, "dataset.jsonl")

3. Launch the GUI

saara gui

Core Workflow

Dataset creation

from saara import DatasetBuilder
from saara.dataset.types import DataType

builder = DatasetBuilder(provider)
samples = builder.from_text(
    "Transformers use attention mechanisms for sequence modeling.",
    data_types=[DataType.FACTUAL, DataType.INSTRUCTION],
    pairs_per_type=3,
)

Fine-tuning

from saara import FineTuner
from saara.training.config import TrainingConfig

config = TrainingConfig(
    model_name="mistralai/Mistral-7B-v0.1",
    num_train_epochs=3,
    use_lora=True,
)

finetuner = FineTuner(config)
finetuner.train("dataset.jsonl")
finetuner.save("./output/models/my-finetune")

Evaluation

from saara import ModelEvaluator
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

student = OllamaProvider(ProviderConfig(model="mistral-7b-finetuned"))
teacher = OllamaProvider(ProviderConfig(model="llama-3-70b"))

evaluator = ModelEvaluator(student, teacher)
metrics = evaluator.evaluate(
    "test.jsonl",
    run_benchmarks=True,
    benchmark_names=["mmlu", "gsm8k"],
)

print(metrics.summary())

Export and quantization

from saara import ModelExporter
from saara.export.formats import ExportFormat
from saara.export.quantization import QuantizationConfig

exporter = ModelExporter(
    model="./output/models/my-finetune",
    config=QuantizationConfig(bits=4),
)

results = exporter.export(
    "./output/exports",
    formats=[ExportFormat.GGUF, ExportFormat.AWQ, ExportFormat.ONNX],
    quantize=True,
)

CLI

saara dataset from-pdf document.pdf -o ./output --data-types instruction factual
saara train finetune dataset.jsonl --model mistralai/Mistral-7B-v0.1 --epochs 3
saara eval model test.jsonl --model mistral --teacher llama-3-70b --benchmarks mmlu gsm8k
saara export model ./models/final --formats gguf awq --quantize --bits 4
saara gui --port 7860

Documentation

Examples

  • examples/01_pdf_to_dataset.py
  • examples/02_synthetic_data.py
  • examples/03_finetune.py
  • examples/04_evaluate.py
  • examples/05_export.py
  • examples/06_complete_pipeline.py
  • examples/07_gui.py

Recommended Starting Point

If you're trying SAARA for the first time, start with:

  • OllamaProvider
  • one small PDF or a short text file
  • LoRA or QLoRA fine-tuning
  • GGUF and safetensors export first

Development

pytest tests/
black saara/
ruff check saara/
mypy saara/

PyPI

https://pypi.org/project/saara-ai/

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saara_ai-0.1.1.tar.gz (35.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saara_ai-0.1.1-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file saara_ai-0.1.1.tar.gz.

File metadata

  • Download URL: saara_ai-0.1.1.tar.gz
  • Upload date:
  • Size: 35.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a3793d8b5fc52d5f3b2b8c57344694321ead27acbba35b48e0a3253e2c873511
MD5 2915bbc50d7ad410c6042931b9367c44
BLAKE2b-256 2a7a93b2f7f9f5278d81302eb0b8cebf2d1a5e208dfea7459ffbe040b17cabfe

See more details on using hashes here.

File details

Details for the file saara_ai-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: saara_ai-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8d22b08fb19fc95ad29229c76e3df9061b173238cdeba28696133b6aeca062fb
MD5 4c20888ca1e97b810f478eecfc24d0e2
BLAKE2b-256 ec4ffa07339239cb61d976fea62c8159c83cd4a2f1b6d40c578099b36a18c307

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page