Deterministic GPU training profiling and optimization CLI

These details have not been verified by PyPI

Project links

Project description

profine

Profile and optimize PyTorch training scripts on real Modal GPUs. Six-step pipeline: read, profile, interpret, suggest, edit, benchmark.

Install

pip install -e .

Requires a Modal account and an LLM API key (OpenAI or Anthropic).

Setup

export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...
export OPENAI_API_KEY=...        # or ANTHROPIC_API_KEY
export HF_TOKEN=...              # optional, for gated models

Pipeline

read → profile → interpret → suggest → edit → benchmark

Each step reads the previous step's output from profine_output/.

1. Read — analyze the training script

profine read nanoGPT/train.py

Extracts model architecture, optimizer, dataloader, precision, and distributed strategy via AST + LLM analysis. Output: profine_output/read/architecture_record.json

2. Profile — run on a real GPU

profine profile nanoGPT/train.py --hardware 1x_a100 --steps 20 --warmup 10

Instruments the script, executes on Modal with torch.profiler, and collects step times, kernel breakdown, GPU utilization, and memory usage. Output: profine_output/profile/profile_record.json

3. Interpret — diagnose bottlenecks

profine interpret --profile-dir profine_output/profile

Deterministic analysis (cost, memory utilization, per-category kernel times) + LLM diagnosis of bottlenecks. Output: profine_output/interpret/bottleneck_report.json

4. Suggest — rank optimizations

profine suggest --interpret-dir profine_output/interpret

Filters applicable optimizations from the catalog, LLM ranks by ROI. Output: profine_output/suggest/suggestion_report.json

5. Edit — apply an optimization

profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --optimization 2
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --optimization torch_compile
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --top 3

The editor is multi-file aware: it discovers local modules the entry script imports and edits whichever file actually owns the code being optimized (e.g. a Trainer class or model module in a separate file). Patched library files land under profine_output/edit/files/<rel-path> — your source tree is never modified.

--top N applies the N top-ranked candidates sequentially, each layered on the previous edit. Per-iteration artifacts go in profine_output/edit/01_<entry_id>/, 02_<entry_id>/, etc.; the cumulative result lands at the standard profine_output/edit/edited_train.py + files/ paths so profine benchmark picks it up unchanged. Optimizations the LLM declines (applied: false) are recorded in the manifest's skipped list and the loop continues.

Output: profine_output/edit/edited_train.py, profine_output/edit/files/<patched library files>, profine_output/edit/change_manifest.json

6. Benchmark — measure the improvement

profine benchmark nanoGPT/train.py --optimized profine_output/edit/edited_train.py --hardware 1x_a100 --steps 20 --warmup 10

Runs original and optimized back-to-back on the same hardware. Auto-loads patched library files from profine_output/edit/files/ as workspace overlays so multi-file edits actually take effect on the optimized run. When the entry script is unchanged (multi-file edit), the same instrumented script is reused for both runs to guarantee data parity. Loss tolerance is widened automatically for optimization classes that legitimately perturb numerics (BF16 / mixed precision: rtol 5%, quantization: rtol 10%); for stacked edits the loosest applicable tolerance wins. Output: profine_output/benchmark/

Hardware Presets

Defined in profine/config/hardware.yaml. Add new GPUs by editing the YAML.

Preset	GPU	VRAM	Cost/hr
`1x_t4`	T4	16 GB	$0.59
`1x_l4`	L4	24 GB	$0.73
`1x_a10g`	A10G	24 GB	$1.10
`1x_a100`	A100	80 GB	$3.73
`1x_h100`	H100	80 GB	$6.98

Configuration

All data tables (hardware presets, optimization catalog, kernel patterns, extractor patterns) live in profine/config/*.yaml and can be extended without code changes.

Command Reference

Global flags (all commands)

Flag	Default	Description
`--provider`	`openai`	LLM provider (`openai` or `anthropic`)
`--api-key`	env var	API key override
`--model`	provider default	Model name override
`-o, --output`	`profine_output`	Output directory
`--prefs`	none	Path to user preferences markdown

`profine read <script>`

No additional flags.

`profine profile <script>`

Flag	Default	Description
`--hardware`	`1x_a100`	Hardware preset name
`--steps`	`60`	Total optimizer steps
`--warmup`	`30`	Warmup steps (discarded from analysis)
`--timeout`	`900`	Modal container timeout in seconds
`--warmstart`	off	Reuse deployed Modal app between runs

`profine interpret`

Flag	Default	Description
`--profile-dir`	required	Directory containing `profile_record.json`

`profine suggest`

Flag	Default	Description
`--interpret-dir`	required	Directory containing `bottleneck_report.json`
`--arch-dir`	auto-detect	Directory containing `architecture_record.json`
`--profile-dir`	auto-detect	Directory containing `profile_record.json`

`profine edit <script>`

Flag	Default	Description
`--suggestion-dir`	required	Directory containing `suggestion_report.json`
`--optimization`	`1` (top-ranked)	Rank number (`1`, `2`, ...) or entry ID (`torch_compile`). Ignored when `--top` is set.
`--top`	unset	Apply the top N ranked optimizations sequentially, each stacked on the previous edit.

`profine benchmark <script>`

Flag	Default	Description
`--optimized`	required	Path to the optimized script
`--hardware`	`1x_a100`	Hardware preset name
`--steps`	`60`	Total optimizer steps
`--warmup`	`30`	Warmup steps
`--rtol`	`0.01`	Relative tolerance for loss correctness check (auto-widened for `precision`/`quantization` classes)
`--atol`	`0.0001`	Absolute tolerance for loss correctness check (auto-widened for `precision`/`quantization` classes)
`--edit-dir`	`<output>/edit`	Directory whose `files/` subtree is overlaid onto the optimized run
`--timeout`	`900`	Modal container timeout in seconds
`--warmstart`	off	Reuse deployed Modal app between runs

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.5

May 12, 2026

0.3.4

May 12, 2026

0.3.3

May 12, 2026

0.3.2

May 12, 2026

0.3.1

May 12, 2026

0.3.0

May 12, 2026

0.2.0

May 12, 2026

This version

0.1.1

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profine-0.1.1.tar.gz (541.7 kB view details)

Uploaded May 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

profine-0.1.1-py3-none-any.whl (124.8 kB view details)

Uploaded May 2, 2026 Python 3

File details

Details for the file profine-0.1.1.tar.gz.

File metadata

Download URL: profine-0.1.1.tar.gz
Upload date: May 2, 2026
Size: 541.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for profine-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`5b48c63204e03a622256c11d420f5f04469c8dc205ef85bddf95900579fbf236`
MD5	`9fdddad80d2a812b86c0638d93386ef1`
BLAKE2b-256	`2fdd6d89cfe194a0675e3dfeac3494f05fabb67d70faf6730491cf12bd35377c`

See more details on using hashes here.

File details

Details for the file profine-0.1.1-py3-none-any.whl.

File metadata

Download URL: profine-0.1.1-py3-none-any.whl
Upload date: May 2, 2026
Size: 124.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for profine-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11792244bc3c1116b1df84a664c32f70f4dad74353403c6888ddee6bee2afc51`
MD5	`099359b4d8d9067dd0ae457ab54097b3`
BLAKE2b-256	`0e9aef64f1dd97c5cb20bb8d02492cbcd6aa2ccdb1f52260c34f122b81fb8333`

See more details on using hashes here.

profine 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

profine

Install

Setup

Pipeline

1. Read — analyze the training script

2. Profile — run on a real GPU

3. Interpret — diagnose bottlenecks

4. Suggest — rank optimizations

5. Edit — apply an optimization

6. Benchmark — measure the improvement

Hardware Presets

Configuration

Command Reference

Global flags (all commands)

profine read <script>

profine profile <script>

profine interpret

profine suggest

profine edit <script>

profine benchmark <script>

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`profine read <script>`

`profine profile <script>`

`profine interpret`

`profine suggest`

`profine edit <script>`

`profine benchmark <script>`