Skip to main content

Deterministic GPU training profiling and optimization CLI

Project description

profine

Profile and optimize PyTorch training scripts on real Modal GPUs. Six-step pipeline: read, profile, interpret, suggest, edit, benchmark.

Install

pip install -e .

Requires a Modal account and an LLM API key (OpenAI or Anthropic).

Setup

export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...
export OPENAI_API_KEY=...        # or ANTHROPIC_API_KEY
export HF_TOKEN=...              # optional, for gated models

Pipeline

read → profile → interpret → suggest → edit → benchmark

Each step reads the previous step's output from profine_output/.

1. Read — analyze the training script

profine read nanoGPT/train.py

Extracts model architecture, optimizer, dataloader, precision, and distributed strategy via AST + LLM analysis. Output: profine_output/read/architecture_record.json

2. Profile — run on a real GPU

profine profile nanoGPT/train.py --hardware 1x_a100 --steps 20 --warmup 10

Instruments the script, executes on Modal with torch.profiler, and collects step times, kernel breakdown, GPU utilization, and memory usage. Output: profine_output/profile/profile_record.json

3. Interpret — diagnose bottlenecks

profine interpret --profile-dir profine_output/profile

Deterministic analysis (cost, memory utilization, per-category kernel times) + LLM diagnosis of bottlenecks. Output: profine_output/interpret/bottleneck_report.json

4. Suggest — rank optimizations

profine suggest --interpret-dir profine_output/interpret

Filters applicable optimizations from the catalog, LLM ranks by ROI. Output: profine_output/suggest/suggestion_report.json

5. Edit — apply an optimization

profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --optimization 2
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --optimization torch_compile
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --top 3

The editor is multi-file aware: it discovers local modules the entry script imports and edits whichever file actually owns the code being optimized (e.g. a Trainer class or model module in a separate file). Patched library files land under profine_output/edit/files/<rel-path> — your source tree is never modified.

--top N applies the N top-ranked candidates sequentially, each layered on the previous edit. Per-iteration artifacts go in profine_output/edit/01_<entry_id>/, 02_<entry_id>/, etc.; the cumulative result lands at the standard profine_output/edit/edited_train.py + files/ paths so profine benchmark picks it up unchanged. Optimizations the LLM declines (applied: false) are recorded in the manifest's skipped list and the loop continues.

Output: profine_output/edit/edited_train.py, profine_output/edit/files/<patched library files>, profine_output/edit/change_manifest.json

6. Benchmark — measure the improvement

profine benchmark nanoGPT/train.py --optimized profine_output/edit/edited_train.py --hardware 1x_a100 --steps 20 --warmup 10

Runs original and optimized back-to-back on the same hardware. Auto-loads patched library files from profine_output/edit/files/ as workspace overlays so multi-file edits actually take effect on the optimized run. When the entry script is unchanged (multi-file edit), the same instrumented script is reused for both runs to guarantee data parity. Loss tolerance is widened automatically for optimization classes that legitimately perturb numerics (BF16 / mixed precision: rtol 5%, quantization: rtol 10%); for stacked edits the loosest applicable tolerance wins. Output: profine_output/benchmark/

Hardware Presets

Defined in profine/config/hardware.yaml. Add new GPUs by editing the YAML.

Preset GPU VRAM Cost/hr
1x_t4 T4 16 GB $0.59
1x_l4 L4 24 GB $0.73
1x_a10g A10G 24 GB $1.10
1x_a100 A100 80 GB $3.73
1x_h100 H100 80 GB $6.98

Configuration

All data tables (hardware presets, optimization catalog, kernel patterns, extractor patterns) live in profine/config/*.yaml and can be extended without code changes.

Command Reference

Global flags (all commands)

Flag Default Description
--provider openai LLM provider (openai or anthropic)
--api-key env var API key override
--model provider default Model name override
-o, --output profine_output Output directory
--prefs none Path to user preferences markdown

profine read <script>

No additional flags.

profine profile <script>

Flag Default Description
--hardware 1x_a100 Hardware preset name
--steps 60 Total optimizer steps
--warmup 30 Warmup steps (discarded from analysis)
--timeout 900 Modal container timeout in seconds
--warmstart off Reuse deployed Modal app between runs

profine interpret

Flag Default Description
--profile-dir required Directory containing profile_record.json

profine suggest

Flag Default Description
--interpret-dir required Directory containing bottleneck_report.json
--arch-dir auto-detect Directory containing architecture_record.json
--profile-dir auto-detect Directory containing profile_record.json

profine edit <script>

Flag Default Description
--suggestion-dir required Directory containing suggestion_report.json
--optimization 1 (top-ranked) Rank number (1, 2, ...) or entry ID (torch_compile). Ignored when --top is set.
--top unset Apply the top N ranked optimizations sequentially, each stacked on the previous edit.

profine benchmark <script>

Flag Default Description
--optimized required Path to the optimized script
--hardware 1x_a100 Hardware preset name
--steps 60 Total optimizer steps
--warmup 30 Warmup steps
--rtol 0.01 Relative tolerance for loss correctness check (auto-widened for precision/quantization classes)
--atol 0.0001 Absolute tolerance for loss correctness check (auto-widened for precision/quantization classes)
--edit-dir <output>/edit Directory whose files/ subtree is overlaid onto the optimized run
--timeout 900 Modal container timeout in seconds
--warmstart off Reuse deployed Modal app between runs

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profine-0.1.1.tar.gz (541.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

profine-0.1.1-py3-none-any.whl (124.8 kB view details)

Uploaded Python 3

File details

Details for the file profine-0.1.1.tar.gz.

File metadata

  • Download URL: profine-0.1.1.tar.gz
  • Upload date:
  • Size: 541.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for profine-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5b48c63204e03a622256c11d420f5f04469c8dc205ef85bddf95900579fbf236
MD5 9fdddad80d2a812b86c0638d93386ef1
BLAKE2b-256 2fdd6d89cfe194a0675e3dfeac3494f05fabb67d70faf6730491cf12bd35377c

See more details on using hashes here.

File details

Details for the file profine-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: profine-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 124.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for profine-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11792244bc3c1116b1df84a664c32f70f4dad74353403c6888ddee6bee2afc51
MD5 099359b4d8d9067dd0ae457ab54097b3
BLAKE2b-256 0e9aef64f1dd97c5cb20bb8d02492cbcd6aa2ccdb1f52260c34f122b81fb8333

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page