One GPU. Full LLM workflow. Real benchmarks. No cloud required.

These details have not been verified by PyPI

Project links

Project description

llm-gpu-lab

One GPU. Full LLM workflow. Real benchmarks. No cloud required.

llm-gpu-lab architecture

llm-gpu-lab is a hands-on, end-to-end LLM toolkit you can run on a single NVIDIA GPU. It walks you from "I have a 4080 and curiosity" to a fine-tuned small LLM, a GGUF deployment, and a self-contained HTML benchmark report — without any paid hosted inference and without any cloud training.

Every number in the report comes from a JSON artifact written on the machine that ran it. There are no fake screenshots, no hand-edited numbers, and no "expected" results substituted for measured ones.

Who this is for

A developer with one consumer NVIDIA GPU (RTX 4080 / 4070 Ti SUPER / 3090 class, 12–24 GB) who wants to learn — or demonstrate — the full local LLM workflow:

environment diagnosis
tokenizer training
tiny GPT pretraining from scratch
generation from the trained checkpoint
LoRA / QLoRA supervised fine-tuning of a small open model
lightweight evaluation
(optional) lm-evaluation-harness integration
GGUF export and llama.cpp serving
GPU benchmarking and a reproducible HTML report

If you already know all of those, this is also a clean, opinionated template you can extend.

Architecture

Two parallel paths share the tokenizer + eval + benchmark + report infrastructure: a from-scratch TinyGPT pretraining path (top) and an open-Hub model + LoRA / QLoRA SFT path (middle), with an optional GGUF export + llama.cpp serve branch off the SFT path. Every box writes a JSON artifact under results/<gpu>/, and the final HTML report renders all of them together.

Quickstart

# 1. Create a venv (uv recommended; plain python -m venv also works)
uv venv --python 3.11 .venv
source .venv/Scripts/activate

# 2. Install PyTorch with the CUDA wheel that matches your driver
pip install --index-url https://download.pytorch.org/whl/cu124 "torch>=2.4"

# 3. Install the project
pip install -e ".[dev,nlp,hub]"

# 4. Verify the environment
python -m llm_gpu_lab doctor --out results/rtx4080/environment.json

# 5. Run the full smoke pipeline
make smoke

# 6. Open the HTML report
xdg-open results/rtx4080/report.html      # Linux
open    results/rtx4080/report.html      # macOS
start   results\rtx4080\report.html      # Windows

The smoke pipeline takes about 30 seconds of compute on an RTX 4080 plus a one-time ~270 MB Hugging Face download for SmolLM2-135M.

What runs on an RTX 4080

Step	Config	Wall clock	Notes
Pretrain (smoke)	`configs/pretrain/tiny_10m_smoke.yaml`	~1.5 s	200 steps, 1.3 M params, ~600 MB VRAM
Pretrain (30 M)	`configs/pretrain/tiny_30m_4080.yaml`	~30 s	2 000 steps, ~30 M params
Pretrain (100 M)	`configs/pretrain/tiny_100m_4080.yaml`	~30 min	TinyStories, ~100 M params
SFT LoRA (smoke)	`configs/sft/smollm2_135m_lora_fallback.yaml`	~8 s	SmolLM2-135M, LoRA r=8, peak ~1.2 GB VRAM
SFT LoRA (Qwen)	`configs/sft/qwen2_5_0_5b_lora_smoke.yaml`	~30 s	Qwen2.5-0.5B-Instruct, LoRA r=8
SFT QLoRA	`configs/sft/qwen3_0_6b_qlora_4080.yaml`	~1 min	Qwen3-0.6B, bnb 4-bit, LoRA r=16

Measured benchmarks (RTX 4080 16 GB, May 2026)

These are the literal numbers from results/rtx4080/benchmark_summary.json and results/rtx4080/pretrain_metrics.json after running make smoke on the maintainer's machine. Re-run on yours and the report will overwrite them with your numbers.

Metric	Value
Pretrain throughput (TinyGPT, 1.3 M params, bf16)	274 152 tokens / s
Pretrain loss (200 steps, smoke config)	8.37 → 2.11 (eval 2.18)
Tiny-GPT autoregressive generation (cuda, 64 new tokens)	~271 tokens / s
Matmul TFLOPS — 2048 × 2048 FP16	36.9 TFLOPS
Matmul TFLOPS — 2048 × 2048 BF16	34.5 TFLOPS
Matmul TFLOPS — 2048 × 2048 FP32	12.6 TFLOPS
LoRA SFT on SmolLM2-135M (30 steps, r=8, seq 256)	loss 3.30 → 2.40 (eval 2.66) in ~8 s
Basic eval pass rate (12 prompts, SFT'd SmolLM2-135M)	6 / 12 = 0.50
GGUF F16 export	269 MB
GGUF Q4_K_M quantization	101 MB, 6.17 BPW

These deliberately include a small failure mode: the basic eval scores 50 % because some of its arithmetic prompts use the first regex-matched number as the answer, and SmolLM2-135M echoes the operands first. We keep that honest — it is a real artifact of the chosen extraction rule, not a bug we hid.

CLI

python -m llm_gpu_lab doctor           --out results/rtx4080/environment.json
python -m llm_gpu_lab train-tokenizer  --config configs/pretrain/tiny_10m_smoke.yaml
python -m llm_gpu_lab pretrain         --config configs/pretrain/tiny_10m_smoke.yaml
python -m llm_gpu_lab generate         --checkpoint artifacts/checkpoints/tiny_10m_smoke/final.safetensors \
                                       --prompts examples/prompts/generation_prompts.txt \
                                       --out results/rtx4080/generation_samples.json
python -m llm_gpu_lab sft              --config configs/sft/smollm2_135m_lora_fallback.yaml
python -m llm_gpu_lab eval             --config configs/eval/smoke_eval.yaml
python -m llm_gpu_lab bench-gpu        --out results/rtx4080/benchmark_summary.json
python -m llm_gpu_lab export-gguf      --config configs/export/gguf_q4_k_m.yaml
python -m llm_gpu_lab serve-llamacpp   --model artifacts/gguf/smollm2_135m_lora.Q4_K_M.gguf --port 8080
python -m llm_gpu_lab report           --results-dir results/rtx4080 --out results/rtx4080/report.html
python -m llm_gpu_lab lm-eval          --base-model HuggingFaceTB/SmolLM2-135M-Instruct --task arc_easy --limit 20

Every command writes a machine-readable JSON artifact under results/rtx4080/. The HTML report only renders sections for artifacts that actually exist on disk.

Project layout

llm-gpu-lab/
├── src/llm_gpu_lab/        # Python package (CLI, models, train, eval, …)
├── configs/                # YAML configs for pretrain / sft / eval / export
├── scripts/                # Shell helpers (setup_env, setup_llamacpp, smoke)
├── examples/prompts/       # eval prompts (jsonl) + generation prompts (txt)
├── docs/                   # quickstart, design, troubleshooting, licenses, …
├── tests/                  # pytest suite (CPU-friendly, GPU tests auto-skip)
├── results/rtx4080/        # committed JSON / HTML artifacts from real runs
├── pyproject.toml          # Python project metadata + ruff + pytest config
├── Makefile                # `make smoke`, `make test`, `make lint`, …
└── .github/workflows/ci.yml# CPU-only CI (lint + tests + audit)

License

Apache-2.0 — see LICENSE. The licence choice and the licences of major dependencies are explained in docs/licenses.md.

Troubleshooting

Start with docs/troubleshooting.md. It covers the issues we actually hit while building this repo: CUDA not available, bitsandbytes import errors, OOM during QLoRA, HF rate limits, llama.cpp build failures, GGUF conversion edge cases, Windows

Unicode console issues, and more.

Roadmap

See docs/roadmap.md — that is also the only file where TODO / FIXME are allowed (enforced by scripts/audit_placeholders.sh).

Contributing

CONTRIBUTING.md. Short version:

make lint, make test, make audit must pass.
Numbers in docs must come from a JSON in results/<gpu>/. No fake benchmark numbers.
Public datasets / models only, with licences documented.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_gpu_lab-0.1.0.tar.gz (59.0 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_gpu_lab-0.1.0-py3-none-any.whl (60.8 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file llm_gpu_lab-0.1.0.tar.gz.

File metadata

Download URL: llm_gpu_lab-0.1.0.tar.gz
Upload date: May 16, 2026
Size: 59.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_gpu_lab-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a7b7d7d9959b553afc7ca362cc765df8c0bbb41d88f8ae2c2568b34d8d8e749f`
MD5	`a734d3e1b7c860642cc9f241341fbe82`
BLAKE2b-256	`3054e3edb14b3ef9a40ff5ba6dfa14990817ab157c09144316a743929399a3a8`

See more details on using hashes here.

File details

Details for the file llm_gpu_lab-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_gpu_lab-0.1.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 60.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for llm_gpu_lab-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`413f7eabdf6f0c5f6792f9ad22d4063a0741de317158a755f645918ad4965967`
MD5	`2a1e2ceceeeca5f675f5f0e5e1e958b5`
BLAKE2b-256	`ea11e89237ba1368c089e1492320c3b4951d18d9817ee855c8db04de0e9a71f5`

See more details on using hashes here.

llm-gpu-lab 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-gpu-lab

Who this is for

Architecture

Quickstart

What runs on an RTX 4080

Measured benchmarks (RTX 4080 16 GB, May 2026)

CLI

Project layout

License

Troubleshooting

Roadmap

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes