Skip to main content

A fast CLI for submitting and managing Azure ML jobs via pure REST APIs

Project description

Azure Jobs

A fast, lightweight CLI for submitting Azure ML jobs through pure REST APIs — no azure-ai-ml SDK and no amlt runtime required.

aj run adds a template inheritance layer on top of three submission backends:

  • native — direct Azure ML REST (default for AML / Singularity).
  • amlt — delegates to the amlt CLI for compatibility.
  • volcano — submits to a Kubernetes Volcano cluster via kubectl.

Install

pipx install azure_jobs

Requires az login. The volcano backend additionally needs kubectl configured against your cluster.

Quickstart

mkdir my-project && cd my-project
aj init                          # scaffold .azure_jobs/, register workspace
aj pull <user>/<repo>            # (optional) clone shared templates
aj run -t gpu train.py           # submit using the "gpu" template

.py scripts run via uv run, .sh via bash. Drop a .codeignore (or .amltignore) at the project root to exclude paths from the upload.

Templates

Templates live under .azure_jobs/template/ as YAML files.

  1. aj init — scaffolds .azure_jobs/ and (optionally) pulls a starter template repo.
  2. aj pull <user>/<repo> — clone a shared template repo into .azure_jobs/.
  3. Hand-author — drop a YAML file into .azure_jobs/template/.

Minimal leaf template (.azure_jobs/template/gpu.yaml):

base: [account.default, storage.default, environment.aml]
config:
  target:
    name: my-cluster
  jobs:
    - name: train
      sku: "{nodes}xA100-80GB"

base chains other YAML files in .azure_jobs/ (dotted name → .azure_jobs/<dir>/<name>.yaml); {nodes} / {processes} are substituted from CLI flags. Inheritance, merge rules, and SKU formats are documented in docs/configuration.md.

aj template list                # see what's available
aj template show <name>         # resolved config (after inheritance)
aj template validate            # sanity check
aj template push -m "msg"       # commit + push back upstream

aj run

aj run -t gpu train.py           # submit via REST
aj run train.py                  # reuse last template
aj run -t gpu -n 4 -p 8 train.py # 4 nodes × 8 GPUs/node
aj run -d train.py               # dry run — print config, don't submit
aj run --amlt -t gpu train.py    # submit via amlt instead
Flag Purpose
-t Template name
-n Number of nodes
-p GPUs per node (drives SKU + AJ_PROCESSES)
--ppn Launcher processes per node (e.g. torchrun --nproc-per-node)
-d Dry run
--amlt Submit via amlt

Positional args after the script are forwarded verbatim to your command.

How it works

  1. Resolve the template, walk the base chain, merge configs.
  2. Apply CLI overrides (-n / -p / --ppn).
  3. Build a normalized SubmitRequest.
  4. Dispatch by backend:
    • native — register environment (SHA-deduped) → upload code (content-addressed) → PUT /jobs/{name}.
    • volcano — render Volcano Job YAML → upload code to a PVC via kubectl exec + tar → kubectl create.
    • amlt — write a submission YAML and shell out to amlt run.
  5. Append a SubmitRecord to record.jsonl and print the portal URL.

Code uploads are content-addressed: identical (template + command + code) → identical hash → re-runs reuse the prior asset.

AJ_* environment variables

Exported into every job. Read them in your training script.

Variable Meaning
AJ_NAME Job display name
AJ_ID Submission ID (matches record.jsonl)
AJ_TEMPLATE Template name used
AJ_NODES Number of nodes
AJ_GPUS_PER_NODE -p value
AJ_PROCESSES AJ_NODES × AJ_GPUS_PER_NODE
AJ_PROCESSES_PER_NODE --ppn value
AJ_SUBMIT_TIMESTAMP_UTC Submission timestamp

Example — torchrun with whatever the user requested:

torchrun \
  --nnodes=$AJ_NODES \
  --nproc_per_node=$AJ_GPUS_PER_NODE \
  --node_rank=$RANK \
  --master_addr=$MASTER_ADDR \
  train.py

aj dash

Interactive TUI dashboard for browsing and managing cloud jobs.

aj dash
Key Action
Move selection
Prev / next page
enter / i Job detail panel
l Open logs (auto-streams if the job is running)
o Pick a different log file
c Cancel the selected job
r Refresh
f / e / w Filter by status / experiment / workspace
/ Search
F Clear all filters
esc Help overlay
q Quit

Use as a Python SDK

The same engine the CLI uses is exposed at the package root, so you can build and submit jobs from your own scripts:

from azure_jobs import (
    Template,
    build_submit_request,
    submit_via_native,   # also: submit_via_volcano, submit_via_amlt
    get_workspace_config,
)

template = Template.from_conf_path(".azure_jobs/template/gpu.yaml")
request = build_submit_request(
    template,
    name="my-job", sid="abc123", sku="2xA100-80GB",
    user_command="train.py", user_args=(),
    workspace=get_workspace_config(),
    template_name="gpu", nodes=2, processes=8,
    code_dir="/path/to/project",  # defaults to os.getcwd()
)
result = submit_via_native(request)
print(result.status, result.portal_url)

See docs/sdk.md for the full surface and a submit_and_record example.

Documentation

Document Contents
Commands aj job, aj template, aj quota, aj sku, aj dash, ...
SDK Programmatic submission API
Architecture Module layout, submission flow, backends
Configuration Templates, inheritance, merge rules, SKU formats
REST API REST client design, endpoints, job body shape
Comparison aj vs amlt feature matrix
Roadmap Planned features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_jobs-0.3.1.tar.gz (929.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_jobs-0.3.1-py3-none-any.whl (230.8 kB view details)

Uploaded Python 3

File details

Details for the file azure_jobs-0.3.1.tar.gz.

File metadata

  • Download URL: azure_jobs-0.3.1.tar.gz
  • Upload date:
  • Size: 929.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.3.1.tar.gz
Algorithm Hash digest
SHA256 ce87c9ed0f228b954ed5dfc7b451eb9c56cedd327f9e228a61c53e6349e3c1fc
MD5 a4104fa422b819a5727223842230ee14
BLAKE2b-256 c0b00a5a6a2955380fb06a6ee038d5762feaf61cea8fe9fa648e11fe493bf19d

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.3.1.tar.gz:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file azure_jobs-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: azure_jobs-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 230.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 64d24dd2546233bef0c75c6b3e9340c59b93b7bcde5fc359e23b356d1b9e0e08
MD5 adacb3f9dce74e66c3b5b90d457c0898
BLAKE2b-256 43e17c36d98bdea10f70d3621f178ed571312f968c0fdf72d756acc41af63b02

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.3.1-py3-none-any.whl:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page