Saara: local-first CLI for dataset generation, labeling, validation, and distillation workflows.
Project description
Saara
Saara is a local-first CLI for ML dataset workflows:
- topic-to-dataset generation using Firecrawl-local research
- PDF/document ingestion foundations
- local model provider routing for Ollama and vLLM-compatible servers
- canonical dataset examples with provenance
- labeling and distillation commands
- validation reports
- exports to JSON, JSONL, CSV, Parquet, Arrow, and Hugging Face Dataset directories
The current implementation is an MVP scaffold intended to be extended into the full CLI.
Full planning docs:
Research artifact:
Quick Start
pip install -e .
saara splash
saara wizard
saara init
saara models health --provider ollama --model qwen
saara generate topic "robotics motion planning" --samples 20 --provider mock --format jsonl --output-dir runs/robotics
saara label .mlforge/datasets/robotics-motion-planning.jsonl --labels useful,not-useful --out labeled.jsonl
saara distill labeled.jsonl --method sft --out distilled.jsonl
saara validate .mlforge/datasets/robotics-motion-planning.jsonl
Running saara without arguments shows the splash screen and command help. Use saara wizard
for the interactive guided flow, and direct subcommands for scripts or automation.
Interactive sessions include terminal animations for the splash screen, menu headers, long-running
operations, and completion states. Scripted or piped output automatically falls back to plain text.
Use --provider mock for deterministic local smoke tests without a running model.
Run a declarative workflow:
saara run examples/topic-dataset.json
Installation
Development install:
python3 -m venv .venv
. .venv/bin/activate
pip install -e .
Install optional dataset exporters:
pip install -e '.[data]'
Install all optional local features:
pip install -e '.[all]'
Fresh machine runtime setup:
saara doctor
saara setup docker --dry-run
saara setup ollama --dry-run
saara setup docker ollama
On Debian/Ubuntu, Saara installs Docker Engine from Docker's official apt repository.
On Linux, Ollama is installed with the official Ollama installer. Review --dry-run
output before running setup commands. Saara does not pull or install models automatically;
choose a model based on your hardware tier.
After installation, use saara directly like a traditional CLI. The old mlforge command remains
available as a compatibility alias during development.
For an isolated user-level install, use pipx once this project is published or packaged:
pipx install .
Firecrawl Local
Topic generation can use Firecrawl-local at http://localhost:3002:
saara generate topic "dataset distillation" \
--provider ollama \
--model qwen \
--research firecrawl \
--samples 100
The Firecrawl integration is exposed as a typed agent tool named firecrawl_local.
The topic workflow uses a bounded ResearchAgent that calls:
firecrawl_local.search(query, limit)firecrawl_local.scrape(url)
LangChain is not required for the core workflow. Saara uses its own small typed tool interface
so Firecrawl-local calls are deterministic, auditable, and easy to test. A small adapter is
included for projects that want LangChain-compatible tools via the optional saara-ai[agents]
extra.
Configurable Dataset Modes
Generation can target multiple training dataset shapes:
finetuning: chat/SFT-style message examplespretraining: plain text examples inoutput.textreasoning: examples with areasoningfieldtool-calling: examples withtoolsandtool_calls
Most runtime and prompting behavior is user-configurable from CLI flags or workflow JSON:
provider base URLs, model names, API keys, Firecrawl URL, system prompt, prompt template,
temperature, max tokens, output format, and output directory. When --output-dir is used,
Saara writes datasets, reports, and run artifacts into that directory.
Runtime Providers
mock: deterministic development providerollama:http://localhost:11434vllm: OpenAI-compatible endpoint, defaulthttp://localhost:8000/v1
Dataset Formats
Supported exports:
jsonjsonlcsvparquetwith optionalpyarrowarrowwith optionalpyarrowhfwith optionaldatasets
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file saara_ai-1.6.9.tar.gz.
File metadata
- Download URL: saara_ai-1.6.9.tar.gz
- Upload date:
- Size: 44.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9c49fbb1efeee844ea5b84f9ee4e696a22a2e56df18d876be3ab1923dd32972
|
|
| MD5 |
8e61e5981c8619bc7a13a0227841ad83
|
|
| BLAKE2b-256 |
c177a425bcb16205bcc33cc3db87c8fff976275a37f7ca6e2cdbd0c99dcdd939
|
File details
Details for the file saara_ai-1.6.9-py3-none-any.whl.
File metadata
- Download URL: saara_ai-1.6.9-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efdb35e7fb33bb1305169143e763565d9a5cc09e0c5beec40ab2d784493aef33
|
|
| MD5 |
0c2b2b58b0499e89e399114f10b6e8f4
|
|
| BLAKE2b-256 |
2e388728f9eec7e01327a3ba1297ad5e286f40ba42967b467941abff33b5b388
|