Skip to main content

A fast CLI for submitting and managing Azure ML jobs via pure REST APIs

Project description

Azure Jobs

A fast, lightweight CLI for submitting Azure ML jobs through pure REST APIs — no azure-ai-ml SDK and no amlt runtime required.

aj run adds a template inheritance layer on top of three submission backends:

  • native — direct Azure ML REST (default for AML / Singularity).
  • amlt — delegates to the amlt CLI for compatibility.
  • volcano — submits to a Kubernetes Volcano cluster via kubectl.

Install

pipx install azure_jobs

Requires az login. The volcano backend additionally needs kubectl configured against your cluster.

Quickstart

mkdir my-project && cd my-project
aj init                          # scaffold .azure_jobs/, register workspace
aj pull <user>/<repo>            # (optional) clone shared templates
aj run -t gpu train.py           # submit using the "gpu" template

.py scripts run via uv run, .sh via bash. Drop a .codeignore (or .amltignore) at the project root to exclude paths from the upload.

Templates

Templates live under .azure_jobs/template/ as YAML files.

  1. aj init — scaffolds .azure_jobs/ and (optionally) pulls a starter template repo.
  2. aj pull <user>/<repo> — clone a shared template repo into .azure_jobs/.
  3. Hand-author — drop a YAML file into .azure_jobs/template/.

Minimal leaf template (.azure_jobs/template/gpu.yaml):

base: [account.default, storage.default, environment.aml]
config:
  target:
    name: my-cluster
  jobs:
    - name: train
      sku: "{nodes}xA100-80GB"

base chains other YAML files in .azure_jobs/ (dotted name → .azure_jobs/<dir>/<name>.yaml); {nodes} / {processes} are substituted from CLI flags. Inheritance, merge rules, and SKU formats are documented in docs/configuration.md.

aj template list                # see what's available
aj template show <name>         # resolved config (after inheritance)
aj template validate            # sanity check
aj template push -m "msg"       # commit + push back upstream

aj run

aj run -t gpu train.py           # submit via REST
aj run train.py                  # reuse last template
aj run -t gpu -n 4 -p 8 train.py # 4 nodes × 8 GPUs/node
aj run -d train.py               # dry run — print config, don't submit
aj run --amlt -t gpu train.py    # submit via amlt instead
Flag Purpose
-t Template name
-n Number of nodes
-p GPUs per node (drives SKU + AJ_PROCESSES)
--ppn Launcher processes per node (e.g. torchrun --nproc-per-node)
-d Dry run
--amlt Submit via amlt

Positional args after the script are forwarded verbatim to your command.

How it works

  1. Resolve the template, walk the base chain, merge configs.
  2. Apply CLI overrides (-n / -p / --ppn).
  3. Build a normalized SubmitRequest.
  4. Dispatch by backend:
    • native — register environment (SHA-deduped) → upload code (content-addressed) → PUT /jobs/{name}.
    • volcano — render Volcano Job YAML → upload code to a PVC via kubectl exec + tar → kubectl create.
    • amlt — write a submission YAML and shell out to amlt run.
  5. Append a SubmitRecord to record.jsonl and print the portal URL.

Code uploads are content-addressed: identical (template + command + code) → identical hash → re-runs reuse the prior asset.

AJ_* environment variables

Exported into every job. Read them in your training script.

Variable Meaning
AJ_NAME Job display name
AJ_ID Submission ID (matches record.jsonl)
AJ_TEMPLATE Template name used
AJ_NODES Number of nodes
AJ_GPUS_PER_NODE -p value
AJ_PROCESSES AJ_NODES × AJ_GPUS_PER_NODE
AJ_PROCESSES_PER_NODE --ppn value
AJ_SUBMIT_TIMESTAMP_UTC Submission timestamp

Example — torchrun with whatever the user requested:

torchrun \
  --nnodes=$AJ_NODES \
  --nproc_per_node=$AJ_GPUS_PER_NODE \
  --node_rank=$RANK \
  --master_addr=$MASTER_ADDR \
  train.py

aj dash

Interactive TUI dashboard for browsing and managing cloud jobs.

aj dash
Key Action
Move selection
Prev / next page
enter / i Job detail panel
l Open logs (auto-streams if the job is running)
o Pick a different log file
c Cancel the selected job
r Refresh
f / e / w Filter by status / experiment / workspace
/ Search
F Clear all filters
esc Help overlay
q Quit

Use as a Python SDK

The same engine the CLI uses is exposed at the package root, so you can build and submit jobs from your own scripts:

from azure_jobs import (
    Template,
    build_submit_request,
    submit_via_native,   # also: submit_via_volcano, submit_via_amlt
    get_workspace_config,
)

template = Template.from_conf_path(".azure_jobs/template/gpu.yaml")
request = build_submit_request(
    template,
    name="my-job", sid="abc123", sku="2xA100-80GB",
    user_command="train.py", user_args=(),
    workspace=get_workspace_config(),
    template_name="gpu", nodes=2, processes=8,
    code_dir="/path/to/project",  # defaults to os.getcwd()
)
result = submit_via_native(request)
print(result.status, result.portal_url)

See docs/sdk.md for the full surface and a submit_and_record example.

Documentation

Document Contents
Commands aj job, aj template, aj quota, aj sku, aj dash, ...
SDK Programmatic submission API
Architecture Module layout, submission flow, backends
Configuration Templates, inheritance, merge rules, SKU formats
REST API REST client design, endpoints, job body shape
Comparison aj vs amlt feature matrix
Roadmap Planned features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_jobs-0.1.39.tar.gz (266.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_jobs-0.1.39-py3-none-any.whl (220.1 kB view details)

Uploaded Python 3

File details

Details for the file azure_jobs-0.1.39.tar.gz.

File metadata

  • Download URL: azure_jobs-0.1.39.tar.gz
  • Upload date:
  • Size: 266.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.1.39.tar.gz
Algorithm Hash digest
SHA256 5bf39e7316a6118a0e76ffef0ce3b62ae9a86e2a573b45cb8c2d9817b9958651
MD5 a8880de1933d655de71ed137c925ab12
BLAKE2b-256 5863abe8d74bf451109948704317e71b77a49acd6af22b6fc3b150cffcfa1026

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.1.39.tar.gz:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file azure_jobs-0.1.39-py3-none-any.whl.

File metadata

  • Download URL: azure_jobs-0.1.39-py3-none-any.whl
  • Upload date:
  • Size: 220.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for azure_jobs-0.1.39-py3-none-any.whl
Algorithm Hash digest
SHA256 362e54e64980abbb3a4e2a8a0b4f37f1a6f5d08ff4689c217abbc7deb70b8f2a
MD5 6d68e98ef446af7f6604657bff4bedf3
BLAKE2b-256 3b74c1199b46cdd19dae4b6e397eadd84c0d7fbaa39e1ab790f9fed578687844

See more details on using hashes here.

Provenance

The following attestation bundles were made for azure_jobs-0.1.39-py3-none-any.whl:

Publisher: release.yml on HSPK/azure_jobs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page