Skip to main content

A fast CLI for submitting and managing Azure ML jobs via pure REST APIs

Project description

Azure Jobs

A fast, lightweight CLI for submitting Azure ML jobs through pure REST APIs — no azure-ai-ml SDK and no amlt runtime required.

aj run adds a template inheritance layer on top of three submission backends:

  • native — direct Azure ML REST (default for AML / Singularity).
  • amlt — delegates to the amlt CLI for compatibility.
  • volcano — submits to a Kubernetes Volcano cluster via kubectl.

Install

pipx install azure_jobs

Requires az login. The volcano backend additionally needs kubectl configured against your cluster.

Quickstart

mkdir my-project && cd my-project
aj init                          # scaffold .azure_jobs/, register workspace
aj pull <user>/<repo>            # (optional) clone shared templates
aj run -t gpu train.py           # submit using the "gpu" template

.py scripts run via uv run, .sh via bash. Drop a .codeignore (or .amltignore) at the project root to exclude paths from the upload.

Templates

Templates live under .azure_jobs/template/ as YAML files.

  1. aj init — scaffolds .azure_jobs/ and (optionally) pulls a starter template repo.
  2. aj pull <user>/<repo> — clone a shared template repo into .azure_jobs/.
  3. Hand-author — drop a YAML file into .azure_jobs/template/.

Minimal leaf template (.azure_jobs/template/gpu.yaml):

base: [account.default, storage.default, environment.aml]
config:
  target:
    name: my-cluster
  jobs:
    - name: train
      sku: "{nodes}xA100-80GB"

base chains other YAML files in .azure_jobs/ (dotted name → .azure_jobs/<dir>/<name>.yaml); {nodes} / {processes} are substituted from CLI flags. Inheritance, merge rules, and SKU formats are documented in docs/configuration.md.

aj template list                # see what's available
aj template show <name>         # resolved config (after inheritance)
aj template validate            # sanity check
aj template push -m "msg"       # commit + push back upstream

aj run

aj run -t gpu train.py           # submit via REST
aj run train.py                  # reuse last template
aj run -t gpu -n 4 -p 8 train.py # 4 nodes × 8 GPUs/node
aj run -d train.py               # dry run — print config, don't submit
aj run --amlt -t gpu train.py    # submit via amlt instead
Flag Purpose
-t Template name
-n Number of nodes
-p GPUs per node (drives SKU + AJ_PROCESSES)
--ppn Launcher processes per node (e.g. torchrun --nproc-per-node)
-d Dry run
--amlt Submit via amlt

Positional args after the script are forwarded verbatim to your command.

How it works

  1. Resolve the template, walk the base chain, merge configs.
  2. Apply CLI overrides (-n / -p / --ppn).
  3. Build a normalized SubmitRequest.
  4. Dispatch by backend:
    • native — register environment (SHA-deduped) → upload code (content-addressed) → PUT /jobs/{name}.
    • volcano — render Volcano Job YAML → upload code to a PVC via kubectl exec + tar → kubectl create.
    • amlt — write a submission YAML and shell out to amlt run.
  5. Append a SubmitRecord to record.jsonl and print the portal URL.

Code uploads are content-addressed: identical (template + command + code) → identical hash → re-runs reuse the prior asset.

AJ_* environment variables

Exported into every job. Read them in your training script.

Variable Meaning
AJ_NAME Job display name
AJ_ID Submission ID (matches record.jsonl)
AJ_TEMPLATE Template name used
AJ_NODES Number of nodes
AJ_GPUS_PER_NODE -p value
AJ_PROCESSES AJ_NODES × AJ_GPUS_PER_NODE
AJ_PROCESSES_PER_NODE --ppn value
AJ_SUBMIT_TIMESTAMP_UTC Submission timestamp

Example — torchrun with whatever the user requested:

torchrun \
  --nnodes=$AJ_NODES \
  --nproc_per_node=$AJ_GPUS_PER_NODE \
  --node_rank=$RANK \
  --master_addr=$MASTER_ADDR \
  train.py

aj dash

Interactive TUI dashboard for browsing and managing cloud jobs.

aj dash
Key Action
Move selection
Prev / next page
enter / i Job detail panel
l Open logs (auto-streams if the job is running)
o Pick a different log file
c Cancel the selected job
r Refresh
f / e / w Filter by status / experiment / workspace
/ Search
F Clear all filters
esc Help overlay
q Quit

Use as a Python SDK

The same engine the CLI uses is exposed at the package root, so you can build and submit jobs from your own scripts:

from azure_jobs import (
    Template,
    build_submit_request,
    submit_via_native,   # also: submit_via_volcano, submit_via_amlt
    get_workspace_config,
)

template = Template.from_conf_path(".azure_jobs/template/gpu.yaml")
request = build_submit_request(
    template,
    name="my-job", sid="abc123", sku="2xA100-80GB",
    user_command="train.py", user_args=(),
    workspace=get_workspace_config(),
    template_name="gpu", nodes=2, processes=8,
    code_dir="/path/to/project",  # defaults to os.getcwd()
)
result = submit_via_native(request)
print(result.status, result.portal_url)

See docs/sdk.md for the full surface and a submit_and_record example.

Documentation

Document Contents
Commands aj job, aj template, aj quota, aj sku, aj dash, ...
SDK Programmatic submission API
Architecture Module layout, submission flow, backends
Configuration Templates, inheritance, merge rules, SKU formats
REST API REST client design, endpoints, job body shape
Comparison aj vs amlt feature matrix
Roadmap Planned features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_jobs-0.3.0.tar.gz (929.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_jobs-0.3.0-py3-none-any.whl (228.7 kB view details)

Uploaded Python 3

File details

Details for the file azure_jobs-0.3.0.tar.gz.

File metadata

  • Download URL: azure_jobs-0.3.0.tar.gz
  • Upload date:
  • Size: 929.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.3.0.tar.gz
Algorithm Hash digest
SHA256 341776abced144af936ff305aaa3ca90025ecfeb366ac6d5b23876987b063f1d
MD5 34dbaf32e60b3526adba7c6f470e7a4c
BLAKE2b-256 0bbed6f3bf2ac5dcb61418a95c73986a8eb0577ff485bd25b1299876a2708887

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.3.0.tar.gz:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file azure_jobs-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: azure_jobs-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 228.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6940c76d2eba9a125bded8165113c171fa8b252ac104bf66492c163fe285a0c8
MD5 8d42166da7310ab030551518c36f707a
BLAKE2b-256 eedc1b06b91d7506f60cd2c67e3e842be53be08951053149ba54ba5d8a25d9b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.3.0-py3-none-any.whl:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page