Skip to main content

Saara: local-first CLI for dataset generation, labeling, validation, and distillation workflows.

Project description

Saara

Saara is a local-first CLI for ML dataset workflows:

  • topic-to-dataset generation using Firecrawl-local research
  • PDF/document ingestion foundations
  • local model provider routing for Ollama and vLLM-compatible servers
  • canonical dataset examples with provenance
  • labeling and distillation commands
  • validation reports
  • exports to JSON, JSONL, CSV, Parquet, Arrow, and Hugging Face Dataset directories

The current implementation is an MVP scaffold intended to be extended into the full CLI.

Full planning docs:

Research artifact:

Quick Start

pip install -e .
saara splash
saara wizard
saara init
saara models health --provider ollama --model qwen
saara generate topic "robotics motion planning" --samples 20 --provider mock --format jsonl --output-dir runs/robotics
saara label .mlforge/datasets/robotics-motion-planning.jsonl --labels useful,not-useful --out labeled.jsonl
saara distill labeled.jsonl --method sft --out distilled.jsonl
saara validate .mlforge/datasets/robotics-motion-planning.jsonl

Running saara without arguments shows the splash screen and command help. Use saara wizard for the interactive guided flow, and direct subcommands for scripts or automation. Interactive sessions include terminal animations for the splash screen, menu headers, long-running operations, and completion states. Scripted or piped output automatically falls back to plain text.

Use --provider mock for deterministic local smoke tests without a running model.

Run a declarative workflow:

saara run examples/topic-dataset.json

Installation

Development install:

python3 -m venv .venv
. .venv/bin/activate
pip install -e .

Install optional dataset exporters:

pip install -e '.[data]'

Install all optional local features:

pip install -e '.[all]'

Fresh machine runtime setup:

saara doctor
saara setup docker --dry-run
saara setup ollama --dry-run
saara setup docker ollama

On Debian/Ubuntu, Saara installs Docker Engine from Docker's official apt repository. On Linux, Ollama is installed with the official Ollama installer. Review --dry-run output before running setup commands. Saara does not pull or install models automatically; choose a model based on your hardware tier.

After installation, use saara directly like a traditional CLI. The old mlforge command remains available as a compatibility alias during development.

For an isolated user-level install, use pipx once this project is published or packaged:

pipx install .

Firecrawl Local

Topic generation can use Firecrawl-local at http://localhost:3002:

saara generate topic "dataset distillation" \
  --provider ollama \
  --model qwen \
  --research firecrawl \
  --samples 100

The Firecrawl integration is exposed as a typed agent tool named firecrawl_local. The topic workflow uses a bounded ResearchAgent that calls:

  • firecrawl_local.search(query, limit)
  • firecrawl_local.scrape(url)

LangChain is not required for the core workflow. Saara uses its own small typed tool interface so Firecrawl-local calls are deterministic, auditable, and easy to test. A small adapter is included for projects that want LangChain-compatible tools via the optional saara-ai[agents] extra.

Configurable Dataset Modes

Generation can target multiple training dataset shapes:

  • finetuning: chat/SFT-style message examples
  • pretraining: plain text examples in output.text
  • reasoning: examples with a reasoning field
  • tool-calling: examples with tools and tool_calls

Most runtime and prompting behavior is user-configurable from CLI flags or workflow JSON: provider base URLs, model names, API keys, Firecrawl URL, system prompt, prompt template, temperature, max tokens, output format, and output directory. When --output-dir is used, Saara writes datasets, reports, and run artifacts into that directory.

Runtime Providers

  • mock: deterministic development provider
  • ollama: http://localhost:11434
  • vllm: OpenAI-compatible endpoint, default http://localhost:8000/v1

Dataset Formats

Supported exports:

  • json
  • jsonl
  • csv
  • parquet with optional pyarrow
  • arrow with optional pyarrow
  • hf with optional datasets

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saara_ai-1.6.9.tar.gz (44.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

saara_ai-1.6.9-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file saara_ai-1.6.9.tar.gz.

File metadata

  • Download URL: saara_ai-1.6.9.tar.gz
  • Upload date:
  • Size: 44.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-1.6.9.tar.gz
Algorithm Hash digest
SHA256 a9c49fbb1efeee844ea5b84f9ee4e696a22a2e56df18d876be3ab1923dd32972
MD5 8e61e5981c8619bc7a13a0227841ad83
BLAKE2b-256 c177a425bcb16205bcc33cc3db87c8fff976275a37f7ca6e2cdbd0c99dcdd939

See more details on using hashes here.

File details

Details for the file saara_ai-1.6.9-py3-none-any.whl.

File metadata

  • Download URL: saara_ai-1.6.9-py3-none-any.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for saara_ai-1.6.9-py3-none-any.whl
Algorithm Hash digest
SHA256 efdb35e7fb33bb1305169143e763565d9a5cc09e0c5beec40ab2d784493aef33
MD5 0c2b2b58b0499e89e399114f10b6e8f4
BLAKE2b-256 2e388728f9eec7e01327a3ba1297ad5e286f40ba42967b467941abff33b5b388

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page