Deterministic GPU training profiling and optimization CLI for PyTorch models
Project description
profine
Website: profine.ai
Profile your PyTorch code on real GPUs. Get a transparent rewrite. Ship measured speedups before the multi-hour run.
Demo
Results
On Karpathy's minGPT, single-A100:
| Metric | Baseline | profine | Δ |
|---|---|---|---|
| Step time | 1.00× | 3.1× faster | −67.7% ms/step |
| Peak memory | 1.00× | −66.4% | substantial headroom for larger batch |
Reproducible with some variation of (as shown in the demo):
profine run-all examples/minGPT/projects/chargpt/chargpt.py --hardware 1x_a100 --steps 25 --warmup 10
Install
pip install profine
Requires:
- A Modal account (GPU execution backend)
- An LLM. OpenAI, Anthropic, or any OpenAI-compatible local server (Ollama, vLLM, LM Studio, llama.cpp, LiteLLM)
export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...
# Pick one LLM:
export OPENAI_API_KEY=... # OpenAI
export ANTHROPIC_API_KEY=... # Anthropic
# ...or run a local server (no API key needed) — see "Local LLMs" below
export HF_TOKEN=... # optional, for gated models
Local LLMs
profine talks to any OpenAI-compatible server. Run with --provider local, supply --model, and (optionally) --base-url.
Ollama (default endpoint http://localhost:11434/v1):
ollama serve &
ollama pull llama3.1:8b
profine run-all path/to/train.py --provider local --model llama3.1:8b
vLLM:
profine run-all path/to/train.py \
--provider local \
--model meta-llama/Llama-3.1-8B-Instruct \
--base-url http://localhost:8000/v1
LM Studio / llama.cpp server / LiteLLM: point --base-url at the server. The endpoint can also be set via the PROFINE_LOCAL_BASE_URL environment variable.
Note: the agent loop expects strong instruction-following and clean JSON output. Smaller open models (≤7B) may struggle on the
interpretandsuggeststeps; we recommend 70B-class models or higher for end-to-end reliability.
Pipeline
read → profile → interpret → suggest → edit → benchmark
Each step reads the previous step's output from profine_output/.
Global flags (all commands): --provider {openai,anthropic,local} (default openai), --api-key, --model, --base-url (for local), -o/--output (default profine_output), --prefs.
Auto (run-all)
Run the entire pipeline end-to-end on one script.
profine run-all examples/minGPT/projects/chargpt/chargpt.py --hardware 1x_a100
| Flag | Default | Description |
|---|---|---|
--hardware |
1x_a100 |
Hardware preset |
--steps |
60 |
Total optimizer steps |
--warmup |
30 |
Warmup steps |
--timeout |
900 |
Modal container timeout (s) |
--warmstart |
off | Reuse deployed Modal app between runs |
--top |
all | Apply top N ranked optimizations |
--rtol / --atol |
0.01 / 0.0001 |
Loss tolerances (auto-widened for precision/quantization) |
Aborts on any failed step. Per-step artifacts land in their usual subdirectories under profine_output/.
1. Read
Extract model architecture, optimizer, dataloader, precision, and distributed strategy via AST + LLM.
profine read nanoGPT/train.py
No additional flags. Output: profine_output/read/architecture_record.json
2. Profile
Instrument the script and run on Modal with torch.profiler; collects step times, kernel breakdown, GPU utilization, and memory.
profine profile nanoGPT/train.py --hardware 1x_a100 --steps 20 --warmup 10
| Flag | Default | Description |
|---|---|---|
--hardware |
1x_a100 |
Hardware preset name |
--steps |
60 |
Total optimizer steps |
--warmup |
30 |
Warmup steps (discarded) |
--timeout |
900 |
Modal container timeout (s) |
--warmstart |
off | Reuse deployed Modal app between runs |
Output: profine_output/profile/profile_record.json
3. Interpret
Deterministic analysis (cost, memory utilization, per-category kernel times) + LLM bottleneck diagnosis.
profine interpret --profile-dir profine_output/profile
| Flag | Default | Description |
|---|---|---|
--profile-dir |
required | Directory containing profile_record.json |
Output: profine_output/interpret/bottleneck_report.json
4. Suggest
Filter applicable optimizations from the catalog; LLM ranks by ROI.
profine suggest --interpret-dir profine_output/interpret
| Flag | Default | Description |
|---|---|---|
--interpret-dir |
required | Directory containing bottleneck_report.json |
--arch-dir |
auto | Directory containing architecture_record.json |
--profile-dir |
auto | Directory containing profile_record.json |
Output: profine_output/suggest/suggestion_report.json
5. Edit
Apply an optimization. Multi-file aware: discovers local modules the entry script imports and edits whichever file owns the code being optimized. Patched library files land under profine_output/edit/files/<rel-path> — your source tree is never modified.
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --optimization torch_compile
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --top 3
| Flag | Default | Description |
|---|---|---|
--suggestion-dir |
required | Directory containing suggestion_report.json |
--optimization |
1 |
Rank (1, 2, ...) or entry ID (torch_compile). Ignored when --top is set. |
--top |
unset | Apply top N ranked optimizations sequentially, stacked. |
With --top N, per-iteration artifacts go in profine_output/edit/01_<entry_id>/, 02_<entry_id>/, etc.; cumulative result at profine_output/edit/edited_train.py. Optimizations the LLM declines are recorded in the manifest's skipped list and the loop continues.
Output: profine_output/edit/edited_train.py, profine_output/edit/files/, profine_output/edit/change_manifest.json
6. Benchmark
Run original and optimized back-to-back on the same hardware. Patched library files in profine_output/edit/files/ are overlaid on the optimized run. Loss tolerance auto-widens for numerics-perturbing classes (BF16/mixed precision: rtol 5%, quantization: rtol 10%).
profine benchmark nanoGPT/train.py --optimized profine_output/edit/edited_train.py --hardware 1x_a100 --steps 20 --warmup 10
| Flag | Default | Description |
|---|---|---|
--optimized |
required | Path to the optimized script |
--hardware |
1x_a100 |
Hardware preset name |
--steps |
60 |
Total optimizer steps |
--warmup |
30 |
Warmup steps |
--rtol |
0.01 |
Relative tolerance for loss check (auto-widened) |
--atol |
0.0001 |
Absolute tolerance for loss check (auto-widened) |
--edit-dir |
<output>/edit |
Directory whose files/ subtree is overlaid |
--timeout |
900 |
Modal container timeout (s) |
--warmstart |
off | Reuse deployed Modal app between runs |
Output: profine_output/benchmark/
Hardware Presets
Defined in profine/config/hardware.yaml.
| Preset | GPU | VRAM | Cost/hr |
|---|---|---|---|
1x_t4 |
T4 | 16 GB | $0.59 |
1x_l4 |
L4 | 24 GB | $0.80 |
1x_a10g |
A10G | 24 GB | $1.10 |
1x_a100 |
A100 | 80 GB | $2.50 |
1x_h100 |
H100 | 80 GB | $3.95 |
Prices from modal.com/pricing.
All data tables (hardware, optimization catalog, kernel patterns, extractor patterns) live in profine/config/*.yaml and can be extended without code changes.
License
MIT. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file profine-0.3.2.tar.gz.
File metadata
- Download URL: profine-0.3.2.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35a56fc2295046ae275c831873b85c75ebe916679c06150daf0db064e8205234
|
|
| MD5 |
457c1220f141b4888e359e4959979bb2
|
|
| BLAKE2b-256 |
79a1b052cd60603c72adbd7eb985f4be7005b4b50bd9292c586dbaed8528cca5
|
File details
Details for the file profine-0.3.2-py3-none-any.whl.
File metadata
- Download URL: profine-0.3.2-py3-none-any.whl
- Upload date:
- Size: 138.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b0a56121a8fa4f116454fb50588fbddc2a082a723ef6c0e5a91dcfa95e39419
|
|
| MD5 |
0c348ab7a8cb602ad33b9ebe2580414a
|
|
| BLAKE2b-256 |
dce56bf2e08085026224b40e58afd14cf9a6241fc33609ca030642694455c39d
|