Skip to main content

A fast CLI for submitting and managing Azure ML jobs via pure REST APIs

Project description

Azure Jobs

A fast, lightweight CLI for submitting Azure ML jobs through pure REST APIs — no azure-ai-ml SDK and no amlt runtime required.

aj run adds a template inheritance layer on top of three submission backends:

  • native — direct Azure ML REST (default for AML / Singularity).
  • amlt — delegates to the amlt CLI for compatibility.
  • volcano — submits to a Kubernetes Volcano cluster via kubectl.

Install

pipx install azure_jobs

Requires az login. The volcano backend additionally needs kubectl configured against your cluster.

Quickstart

mkdir my-project && cd my-project
aj init                          # scaffold .azure_jobs/, register workspace
aj pull <user>/<repo>            # (optional) clone shared templates
aj run -t gpu train.py           # submit using the "gpu" template

.py scripts run via uv run, .sh via bash. Drop a .codeignore (or .amltignore) at the project root to exclude paths from the upload.

Templates

Templates live under .azure_jobs/template/ as YAML files.

  1. aj init — scaffolds .azure_jobs/ and (optionally) pulls a starter template repo.
  2. aj pull <user>/<repo> — clone a shared template repo into .azure_jobs/.
  3. Hand-author — drop a YAML file into .azure_jobs/template/.

Minimal leaf template (.azure_jobs/template/gpu.yaml):

base: [account.default, storage.default, environment.aml]
config:
  target:
    name: my-cluster
  jobs:
    - name: train
      sku: "{nodes}xA100-80GB"

base chains other YAML files in .azure_jobs/ (dotted name → .azure_jobs/<dir>/<name>.yaml); {nodes} / {processes} are substituted from CLI flags. Inheritance, merge rules, and SKU formats are documented in docs/configuration.md.

aj template list                # see what's available
aj template show <name>         # resolved config (after inheritance)
aj template validate            # sanity check
aj template push -m "msg"       # commit + push back upstream

aj run

aj run -t gpu train.py           # submit via REST
aj run train.py                  # reuse last template
aj run -t gpu -n 4 -p 8 train.py # 4 nodes × 8 GPUs/node
aj run -d train.py               # dry run — print config, don't submit
aj run -L train.py               # run locally
aj run --amlt -t gpu train.py    # submit via amlt instead
Flag Purpose
-t Template name
-n Number of nodes
-p GPUs per node (drives SKU + AJ_PROCESSES)
--ppn Launcher processes per node (e.g. torchrun --nproc-per-node)
-d Dry run
-y Skip confirmation
-L Run locally
--amlt Submit via amlt

Positional args after the script are forwarded verbatim to your command.

How it works

  1. Resolve the template, walk the base chain, merge configs.
  2. Apply CLI overrides (-n / -p / --ppn).
  3. Build a normalized SubmitRequest.
  4. Dispatch by backend:
    • native — register environment (SHA-deduped) → upload code (content-addressed) → PUT /jobs/{name}.
    • volcano — render Volcano Job YAML → upload code to a PVC via kubectl exec + tar → kubectl create.
    • amlt — write a submission YAML and shell out to amlt run.
  5. Append a SubmitRecord to record.jsonl and print the portal URL.

Code uploads are content-addressed: identical (template + command + code) → identical hash → re-runs reuse the prior asset.

AJ_* environment variables

Exported into every job. Read them in your training script.

Variable Meaning
AJ_NAME Job display name
AJ_ID Submission ID (matches record.jsonl)
AJ_TEMPLATE Template name used
AJ_NODES Number of nodes
AJ_GPUS_PER_NODE -p value
AJ_PROCESSES AJ_NODES × AJ_GPUS_PER_NODE
AJ_PROCESSES_PER_NODE --ppn value
AJ_SUBMIT_TIMESTAMP_UTC Submission timestamp

Example — torchrun with whatever the user requested:

torchrun \
  --nnodes=$AJ_NODES \
  --nproc_per_node=$AJ_GPUS_PER_NODE \
  --node_rank=$RANK \
  --master_addr=$MASTER_ADDR \
  train.py

aj dash

Interactive TUI dashboard for browsing and managing cloud jobs.

aj dash
Key Action
Move selection
Prev / next page
enter / i Job detail panel
l Open logs (auto-streams if the job is running)
o Pick a different log file
c Cancel the selected job
r Refresh
f / e / w Filter by status / experiment / workspace
/ Search
F Clear all filters
esc Help overlay
q Quit

Use as a Python SDK

The same engine the CLI uses is exposed at the package root, so you can build and submit jobs from your own scripts:

from azure_jobs import (
    Template,
    build_submit_request,
    submit_via_native,   # also: submit_via_volcano, submit_via_amlt
    get_workspace_config,
)

template = Template.from_conf_path(".azure_jobs/template/gpu.yaml")
request = build_submit_request(
    template,
    name="my-job", sid="abc123", sku="2xA100-80GB",
    user_command="train.py", user_args=(),
    workspace=get_workspace_config(),
    template_name="gpu", nodes=2, processes=8,
    code_dir="/path/to/project",  # defaults to os.getcwd()
)
result = submit_via_native(request)
print(result.status, result.portal_url)

See docs/sdk.md for the full surface and a submit_and_record example.

Documentation

Document Contents
Commands aj job, aj template, aj quota, aj sku, aj dash, ...
SDK Programmatic submission API
Architecture Module layout, submission flow, backends
Configuration Templates, inheritance, merge rules, SKU formats
REST API REST client design, endpoints, job body shape
Comparison aj vs amlt feature matrix
Roadmap Planned features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_jobs-0.1.35.tar.gz (220.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_jobs-0.1.35-py3-none-any.whl (171.3 kB view details)

Uploaded Python 3

File details

Details for the file azure_jobs-0.1.35.tar.gz.

File metadata

  • Download URL: azure_jobs-0.1.35.tar.gz
  • Upload date:
  • Size: 220.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.1.35.tar.gz
Algorithm Hash digest
SHA256 4d2db825b112bca4d595f72c895178854b1a1b2c8201480ce93d08e71d752b22
MD5 e720f9b730decbaecf77a52cf40b66df
BLAKE2b-256 00b255285b9151bcb818cc71c1a9ec5cff63b92432bcad56b187f2b4a3183ae8

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.1.35.tar.gz:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file azure_jobs-0.1.35-py3-none-any.whl.

File metadata

  • Download URL: azure_jobs-0.1.35-py3-none-any.whl
  • Upload date:
  • Size: 171.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.1.35-py3-none-any.whl
Algorithm Hash digest
SHA256 fdfd4cbc11831198dc3761f4285134bf1e8e1f4386bf298e40deba28a9c812dc
MD5 e59f111a1fb72e85e8c2f64720e46728
BLAKE2b-256 ae0befc790ec9378bf2622d19489cd86366bfb2d5d5efdc277e7520ace2477c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.1.35-py3-none-any.whl:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page