Skip to main content

One GPU. Full LLM workflow. Real benchmarks. No cloud required.

Project description

llm-gpu-lab — one GPU, full LLM workflow, real benchmarks: from-scratch pretraining to LoRA/QLoRA fine-tuning, GGUF deployment and reproducible HTML reports.

CI PyPI Python License

One GPU. Full LLM workflow. Real benchmarks. No cloud required.

llm-gpu-lab architecture

llm-gpu-lab is a hands-on, end-to-end LLM toolkit you can run on a single NVIDIA GPU. It walks you from "I have a 4080 and curiosity" to a fine-tuned small LLM, a GGUF deployment, and a self-contained HTML benchmark report — without any paid hosted inference and without any cloud training.

Every number in the report comes from a JSON artifact written on the machine that ran it. There are no fake screenshots, no hand-edited numbers, and no "expected" results substituted for measured ones.

Who this is for

A developer with one consumer NVIDIA GPU (RTX 4080 / 4070 Ti SUPER / 3090 class, 12–24 GB) who wants to learn — or demonstrate — the full local LLM workflow:

  1. environment diagnosis
  2. tokenizer training
  3. tiny GPT pretraining from scratch
  4. generation from the trained checkpoint
  5. LoRA / QLoRA supervised fine-tuning of a small open model
  6. lightweight evaluation
  7. (optional) lm-evaluation-harness integration
  8. GGUF export and llama.cpp serving
  9. GPU benchmarking and a reproducible HTML report

If you already know all of those, this is also a clean, opinionated template you can extend.

Architecture

Two parallel paths share the tokenizer + eval + benchmark + report infrastructure: a from-scratch TinyGPT pretraining path (top) and an open-Hub model + LoRA / QLoRA SFT path (middle), with an optional GGUF export + llama.cpp serve branch off the SFT path. Every box writes a JSON artifact under results/<gpu>/, and the final HTML report renders all of them together.

Quickstart

# 1. Create a venv (uv recommended; plain python -m venv also works)
uv venv --python 3.11 .venv
source .venv/Scripts/activate

# 2. Install PyTorch with the CUDA wheel that matches your driver
pip install --index-url https://download.pytorch.org/whl/cu124 "torch>=2.4"

# 3a. Install from PyPI (most users)
pip install "llm-gpu-lab[nlp,hub]"

# 3b. ...or from source if you want to hack on it
git clone https://github.com/DaoyuanLi2816/llm-gpu-lab && cd llm-gpu-lab
pip install -e ".[dev,nlp,hub]"

# 4. Verify the environment
python -m llm_gpu_lab doctor --out results/rtx4080/environment.json

# 5. Run the full smoke pipeline
make smoke

# 6. Open the HTML report
xdg-open results/rtx4080/report.html      # Linux
open    results/rtx4080/report.html      # macOS
start   results\rtx4080\report.html      # Windows

The smoke pipeline takes about 30 seconds of compute on an RTX 4080 plus a one-time ~270 MB Hugging Face download for SmolLM2-135M.

What runs on an RTX 4080

Step Config Wall clock Notes
Pretrain (smoke) configs/pretrain/tiny_10m_smoke.yaml ~1.5 s 200 steps, 1.3 M params, ~600 MB VRAM
Pretrain (30 M) configs/pretrain/tiny_30m_4080.yaml ~30 s 2 000 steps, ~30 M params
Pretrain (100 M) configs/pretrain/tiny_100m_4080.yaml ~30 min TinyStories, ~100 M params
SFT LoRA (smoke) configs/sft/smollm2_135m_lora_fallback.yaml ~8 s SmolLM2-135M, LoRA r=8, peak ~1.2 GB VRAM
SFT LoRA (Qwen) configs/sft/qwen2_5_0_5b_lora_smoke.yaml ~30 s Qwen2.5-0.5B-Instruct, LoRA r=8
SFT QLoRA configs/sft/qwen3_0_6b_qlora_4080.yaml ~1 min Qwen3-0.6B, bnb 4-bit, LoRA r=16

Measured benchmarks (RTX 4080 16 GB)

These are the literal numbers from results/rtx4080/benchmark_summary.json and results/rtx4080/pretrain_metrics.json after running make smoke on the maintainer's machine. Re-run on yours and the report will overwrite them with your numbers.

Metric Value
Pretrain throughput (TinyGPT, 1.3 M params, bf16) 274 152 tokens / s
Pretrain loss (200 steps, smoke config) 8.37 → 2.11 (eval 2.18)
Tiny-GPT autoregressive generation (cuda, 64 new tokens) ~271 tokens / s
Matmul TFLOPS — 2048 × 2048 FP16 36.9 TFLOPS
Matmul TFLOPS — 2048 × 2048 BF16 34.5 TFLOPS
Matmul TFLOPS — 2048 × 2048 FP32 12.6 TFLOPS
LoRA SFT on SmolLM2-135M (30 steps, r=8, seq 256) loss 3.30 → 2.40 (eval 2.66) in ~8 s
Basic eval pass rate (12 prompts, SFT'd SmolLM2-135M) 6 / 12 = 0.50
GGUF F16 export 269 MB
GGUF Q4_K_M quantization 101 MB, 6.17 BPW

These deliberately include a small failure mode: the basic eval scores 50 % because some of its arithmetic prompts use the first regex-matched number as the answer, and SmolLM2-135M echoes the operands first. We keep that honest — it is a real artifact of the chosen extraction rule, not a bug we hid.

CLI

python -m llm_gpu_lab doctor           --out results/rtx4080/environment.json
python -m llm_gpu_lab train-tokenizer  --config configs/pretrain/tiny_10m_smoke.yaml
python -m llm_gpu_lab pretrain         --config configs/pretrain/tiny_10m_smoke.yaml
python -m llm_gpu_lab generate         --checkpoint artifacts/checkpoints/tiny_10m_smoke/final.safetensors \
                                       --prompts examples/prompts/generation_prompts.txt \
                                       --out results/rtx4080/generation_samples.json
python -m llm_gpu_lab sft              --config configs/sft/smollm2_135m_lora_fallback.yaml
python -m llm_gpu_lab eval             --config configs/eval/smoke_eval.yaml
python -m llm_gpu_lab bench-gpu        --out results/rtx4080/benchmark_summary.json
python -m llm_gpu_lab export-gguf      --config configs/export/gguf_q4_k_m.yaml
python -m llm_gpu_lab serve-llamacpp   --model artifacts/gguf/smollm2_135m_lora.Q4_K_M.gguf --port 8080
python -m llm_gpu_lab report           --results-dir results/rtx4080 --out results/rtx4080/report.html
python -m llm_gpu_lab lm-eval          --base-model HuggingFaceTB/SmolLM2-135M-Instruct --task arc_easy --limit 20

Every command writes a machine-readable JSON artifact under results/rtx4080/. The HTML report only renders sections for artifacts that actually exist on disk.

Project layout

llm-gpu-lab/
├── src/llm_gpu_lab/        # Python package (CLI, models, train, eval, …)
├── configs/                # YAML configs for pretrain / sft / eval / export
├── scripts/                # Shell helpers (setup_env, setup_llamacpp, smoke)
├── examples/prompts/       # eval prompts (jsonl) + generation prompts (txt)
├── docs/                   # quickstart, design, troubleshooting, licenses, …
├── tests/                  # pytest suite (CPU-friendly, GPU tests auto-skip)
├── results/rtx4080/        # committed JSON / HTML artifacts from real runs
├── pyproject.toml          # Python project metadata + ruff + pytest config
├── Makefile                # `make smoke`, `make test`, `make lint`, …
└── .github/workflows/ci.yml# CPU-only CI (lint + tests + audit)

License

Apache-2.0 — see LICENSE. The licence choice and the licences of major dependencies are explained in docs/licenses.md.

Troubleshooting

Start with docs/troubleshooting.md. It covers the issues we actually hit while building this repo: CUDA not available, bitsandbytes import errors, OOM during QLoRA, HF rate limits, llama.cpp build failures, GGUF conversion edge cases, Windows

  • Unicode console issues, and more.

Roadmap

See docs/roadmap.md — that is also the only file where TODO / FIXME are allowed (enforced by scripts/audit_placeholders.sh).

Contributing

CONTRIBUTING.md. Short version:

  • make lint, make test, make audit must pass.
  • Numbers in docs must come from a JSON in results/<gpu>/. No fake benchmark numbers.
  • Public datasets / models only, with licences documented.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_gpu_lab-0.1.1.tar.gz (58.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_gpu_lab-0.1.1-py3-none-any.whl (61.1 kB view details)

Uploaded Python 3

File details

Details for the file llm_gpu_lab-0.1.1.tar.gz.

File metadata

  • Download URL: llm_gpu_lab-0.1.1.tar.gz
  • Upload date:
  • Size: 58.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_gpu_lab-0.1.1.tar.gz
Algorithm Hash digest
SHA256 82bfa36ab9ca72f82980524a2cbe3a4a5c9c231e604c6fb2aab960a4de1d0233
MD5 b7f9fd0b70403d5f7eac9eddbd0197bf
BLAKE2b-256 b21db953f3b8ec1fa285226b366c4e54c720eff7d8152b079844f834024ba545

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_gpu_lab-0.1.1.tar.gz:

Publisher: release.yml on DaoyuanLi2816/llm-gpu-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_gpu_lab-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_gpu_lab-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 61.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_gpu_lab-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e10c8327a92b0dff55d43030c6a8d9dd11edce6918a5a18912ebfdf6cb4c1d0
MD5 d39ab21722092a806164833d4d1eca46
BLAKE2b-256 28e73630ba7c3435efe82f3dc58c74e9b14cc9a39ac06caa8409e83af2f02d92

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_gpu_lab-0.1.1-py3-none-any.whl:

Publisher: release.yml on DaoyuanLi2816/llm-gpu-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page