Self-improving SLM training platform — autonomous fine-tuning with Claude-powered experiment planning
Project description
forge — Self-Improving SLM Training Platform
Turn any directory into a fine-tuning project that trains, evaluates, and improves itself toward a target accuracy.
forge is a CLI tool that wraps the full small-language-model training lifecycle — data generation, training, evaluation, promotion, and augmentation — into a single autonomous loop called the flywheel. You define the task; forge does the rest.
Quick Start
pip install slm-forge # or: uv pip install slm-forge
forge init my-task # scaffold a new project
cd my-task
# edit BACKGROUND.md and mission.md, add examples to dataset/gold.jsonl
forge flywheel --iters 10 # start the autonomous loop
Running on a Remote GPU (e.g. pasture1)
forge is designed to run on any CUDA machine. For a remote GPU host accessed via SSH:
1. SSH in and activate the project
ssh -i ~/.ssh/id_ed25519 user@your-gpu-host
cd /path/to/your-forge-project
source .venv/bin/activate
2. Run in a detached tmux session
# Launch flywheel in background (survives SSH disconnect)
tmux new-session -d -s forge-run \
'forge flywheel --iters 10 2>&1 | tee output/flywheel.log'
# Attach to watch live output
tmux attach -t forge-run
# Detach without killing: Ctrl-B then D
# Tail log without attaching
tail -f output/flywheel.log
3. Monitor progress
# Quick accuracy check from last eval
cat output/eval_results.json | python3 -c \
"import sys,json; r=json.load(sys.stdin); print(f\"{r['accuracy']:.1%} ({r['correct']}/{r['total']})\")"
# Per-command breakdown
cat output/eval_results.json | python3 -c "
import sys, json
r = json.load(sys.stdin)
print(f'Accuracy: {r[\"accuracy\"]:.1%} Failures: {r[\"failure_count\"]}')
for cmd, d in sorted(r['per_cmd'].items()):
pct = d['correct']/d['total']*100 if d['total'] else 0
bar = '✅' if pct == 100 else ('⚠️' if pct >= 80 else '❌')
print(f' {bar} {cmd:20s} {d[\"correct\"]}/{d[\"total\"]} ({pct:.0f}%)')
"
# Watch heartbeat file (updated after every iteration)
watch -n 10 cat output/flywheel_heartbeat.json
4. Push a manual dataset patch mid-run
# From your local machine: copy patch file to the GPU host
scp -i ~/.ssh/id_ed25519 /tmp/patch.jsonl user@your-gpu-host:/path/to/project/dataset/
# On the GPU host: merge and bump version
python3 -c "
import json
base = [json.loads(l) for l in open('dataset/canonical.jsonl')]
patch = [json.loads(l) for l in open('dataset/patch.jsonl')]
existing = {next(m['content'] for m in e['messages'] if m['role']=='user') for e in base}
new = [e for e in patch if next(m['content'] for m in e['messages'] if m['role']=='user') not in existing]
print(f'Net new: {len(new)}')
with open('dataset/canonical.jsonl', 'a') as f:
for e in new: f.write(json.dumps(e) + '\n')
"
# Then retrain: forge train && forge eval
How It Works
The flywheel is a closed loop. Each iteration, Claude acts as an ML experiment planner: it reads what has been tried, hypothesizes what might work better, patches config, generates targeted training data, trains the model, and evaluates it. If the model improved, the adapter is promoted. Then it commits everything to git and repeats.
┌─────────────────────────────────────────────────────────────┐
│ forge flywheel loop │
│ │
│ ┌───────┐ ┌───────┐ ┌──────┐ ┌─────────┐ │
│ │ plan │───▶│ train │───▶│ eval │───▶│ promote │ │
│ └───────┘ └───────┘ └──────┘ └────┬────┘ │
│ ▲ │ │
│ │ ┌─────────┐ ┌────────┐ │ │
│ └─────────│ commit │◀───│augment │◀────┘ │
│ └─────────┘ └────────┘ │
│ │
│ Stops when: target_accuracy reached OR max_iterations hit │
└─────────────────────────────────────────────────────────────┘
Step by step:
- Plan — Claude reads
BACKGROUND.md,AGENT_CONTEXT.md,PLAN.md, and recent eval failures. It returns anExperimentPlanwith: a hypothesis, config patches to try this iteration, new background knowledge to append, and an augmentation focus. - Train — forge applies config patches and trains the LoRA adapter for the planned number of epochs.
- Eval — forge evaluates the adapter against
dataset/gold.jsonl. No cheating: gold data is never used in training. - Promote — if accuracy improved, the adapter is snapshotted to
output/best_adapter_<acc>/. - Augment — Claude generates targeted training examples focused on the eval failures.
- Commit — git commit captures the full state: data, config, context files, and adapter snapshot.
- Repeat —
AGENT_CONTEXT.mdandPLAN.mdare updated, and the loop restarts.
Installation
From PyPI
uv pip install slm-forge
# or
pip install slm-forge
From source
git clone https://github.com/your-org/slm-pipeline
cd slm-pipeline
uv pip install -e .
Requirements
| Requirement | Notes |
|---|---|
| Python 3.10+ | 3.11 recommended |
| CUDA GPU | 16 GB+ VRAM recommended (24 GB for larger models) |
| Unsloth | For fast LoRA training — pip install unsloth |
ANTHROPIC_API_KEY |
For datagen and planning — set in environment |
No GPU for experimentation? You can still run
forge generate,forge audit, andforge statuson CPU. Onlyforge trainandforge evalneed CUDA.
Project Structure
Running forge init my-task creates:
my-task/
├── forge.yaml # All config: model, LoRA, training, eval, flywheel
├── BACKGROUND.md # Domain knowledge for Claude datagen (edit this!)
├── mission.md # Task description for Claude datagen
├── system_prompt.md # Exact system prompt used at train + eval time
├── llm.txt # Briefing doc for any AI agent starting fresh
├── .gitignore # Ignores output/ except best_adapter/ and best_score.json
│
├── dataset/
│ ├── seed.jsonl # Your hand-crafted seed examples (optional)
│ ├── gold.jsonl # LOCKED eval set — never train on this
│ └── canonical.jsonl # Training data (forge generates + audits this)
│
└── output/ # Created at runtime
├── adapter/ # Current training output
├── best_adapter_*/ # Snapshots of best adapters (one per improvement)
├── best_score.json # Best accuracy achieved so far
├── experiments.jsonl # Full log of every iteration
├── flywheel_heartbeat.json # Live status for monitoring
└── flywheel.log # Verbose run log
Auto-created during flywheel:
├── AGENT_CONTEXT.md # Auto-updated iteration history (scores, failures)
└── PLAN.md # Claude's current strategy and hypotheses
Key Files Explained
forge.yaml — All Configuration
The single source of truth for everything. Here is a fully-annotated example:
name: my-task # project name (informational)
model:
base: unsloth/functiongemma-270m-it # HuggingFace model ID to fine-tune
max_seq_len: 2048 # maximum token sequence length
load_in_4bit: true # 4-bit quantization (saves VRAM, recommended)
lora:
r: 32 # LoRA rank — higher = more capacity, slower training
alpha: 32 # LoRA alpha — usually set equal to r
dropout: 0.05 # LoRA dropout regularization
target_modules: # which linear layers to train (model-specific)
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
training:
epochs_min: 2 # minimum epochs to run (even if loss converges early)
epochs_max: 4 # maximum epochs before stopping
batch_size: 4 # examples per gradient step
gradient_accumulation_steps: 1 # effective batch = batch_size × this
learning_rate: 5.0e-5 # AdamW learning rate
warmup_ratio: 0.1 # fraction of steps for LR warmup
weight_decay: 0.01 # L2 regularization
lr_scheduler: cosine # cosine | linear | constant
seed: 42 # random seed for reproducibility
eval:
target_accuracy: 0.95 # stop the flywheel when this accuracy is reached
scorer: json_cmd # exact | json_cmd | custom
max_new_tokens: 256 # max tokens for model output during eval
forbidden_commands: [] # outputs that are always wrong (safety)
dataset:
seed: dataset/seed.jsonl # seed examples (copied into canonical on first run)
gold: dataset/gold.jsonl # locked eval set
canonical: dataset/canonical.jsonl # training data (grows each iteration)
promotion:
hf_repo: null # HuggingFace repo to push to (e.g. my-org/my-model)
private: true # push as private repo
flywheel:
max_iterations: 10 # max autonomous iterations before stopping
augment_per_failure: 20 # synthetic examples to generate per failure category
datagen_model: claude-sonnet-4-6 # Claude model for data generation
planner_model: claude-sonnet-4-6 # Claude model for experiment planning
compute:
device: cuda:0 # which GPU to use (cuda:0, cuda:1, cpu)
remote: null # e.g. ssh://user@host for remote execution
BACKGROUND.md — Domain Knowledge for Datagen
This is the most important file for data quality. Claude reads it in full (never truncated) before generating every training example. Think of it as your prompt engineering for synthetic data.
See Writing Good BACKGROUND.md below.
What goes here:
- The exact task description
- The required output format with a complete example
- Every edge case you know about (with correct answers)
- Common failure patterns you've observed
- What makes a good vs bad training example
What does NOT go here:
- Iteration history, scores, or run logs — those belong in
AGENT_CONTEXT.md(auto-managed) - Speculation or hypotheses — those belong in
PLAN.md(auto-managed)
system_prompt.md — The Model's System Prompt
The exact system prompt that is injected at both training time and eval time. This is the single source of truth — train.py, eval.py, and datagen all load from this file.
Edit with care. Changing this file mid-training is a major disruption because the model was trained on a different prompt.
mission.md — Task Description for Claude
A natural language description of the task, fed to Claude when generating training data. Unlike BACKGROUND.md (which is highly structured), mission.md is free-form prose explaining why the task matters, what the inputs look like, and any constraints Claude should respect.
BACKGROUND.md vs AGENT_CONTEXT.md
BACKGROUND.md |
AGENT_CONTEXT.md |
|
|---|---|---|
| Author | You (the human) | forge (auto-managed) |
| Content | Static domain knowledge | Dynamic iteration history |
| Update frequency | Rarely, manually | After every eval |
| Truncated? | Never | Yes (keeps most recent) |
| Purpose | Ground truth for datagen | Context for planning |
BACKGROUND.md should be written once and kept accurate. AGENT_CONTEXT.md is a running log of scores, failures, and hypotheses — never edit it manually.
CLI Reference
forge init <name>
Scaffold a new forge project.
forge init my-router
forge init my-router --dir /path/to/existing-dir
forge init my-router --from /path/to/old-project # migrate from llamadrone format
forge generate
Generate synthetic training data via batched Claude API calls. Fires all batches concurrently, validates each example (assistant must be valid JSON, user must be non-empty), and deduplicates against existing canonical.jsonl.
forge generate # 500 examples, batch size 50
forge generate --n 200 --batch-size 25 # 200 examples in batches of 25
forge generate --n 1000 --output dataset/extra.jsonl
forge generate -p /path/to/project # specify project dir explicitly
forge audit [--fix]
Claude reviews every example in canonical.jsonl for format errors, factual mistakes, and quality issues. With --fix, it rewrites bad examples in place.
forge audit # print report, no changes
forge audit --fix # fix errors automatically
forge audit --fix --min-severity warning # fix warnings too
Audit report columns: index, severity (error/warning/info/ok), issue, suggested_fix.
forge train [--epochs N]
Train the LoRA adapter on dataset/canonical.jsonl. Output goes to output/adapter/.
forge train # use epochs from forge.yaml
forge train --epochs 3 # override epoch count
forge eval
Evaluate the current adapter against dataset/gold.jsonl. Prints accuracy and writes failures to output/eval_failures.jsonl.
forge eval
forge eval -p /path/to/project
forge flywheel --iters N
Run the full autonomous loop. This is the main command.
forge flywheel # run until target_accuracy or max_iterations
forge flywheel --iters 5 # run exactly 5 iterations
forge flywheel --skip-train # skip training on first iteration (eval existing adapter)
forge flywheel --iters 10 -p ./my-task
forge status
Show current project state: best accuracy, dataset sizes, last iteration, heartbeat status.
forge status
forge status -p /path/to/project
forge promote
Manually snapshot the current adapter to output/best_adapter_<timestamp>/. The flywheel does this automatically when accuracy improves.
forge promote
forge push
Upload the best adapter to HuggingFace (requires hf_repo in forge.yaml and HF_TOKEN).
forge push
HF_TOKEN=hf_xxx forge push
forge clean
Remove all augmented (flywheel-generated) examples from canonical.jsonl, keeping only your seed examples and hand-crafted data. Useful when starting a fresh training run.
forge clean
forge clean --keep-seed # keep seed.jsonl examples only
forge patch "description"
One-shot targeted data generation based on a plain-English description of what to generate. Useful for quickly patching gaps without running the full flywheel.
forge patch "more examples where the input contains coordinates as lat/lon floats"
forge patch "edge cases where the user says 'abort' vs 'cancel'" --n 50
forge augment
Generate training data targeted at the most recent eval failures (reads output/eval_failures.jsonl). The flywheel does this automatically; use this command for manual augmentation.
forge augment
forge augment --n 100 # generate 100 examples
The Autonomous Flywheel
The flywheel is forge's key innovation. Each iteration, Claude acts as an ML experiment planner with full context of everything that has been tried.
What Claude Reads Each Iteration
BACKGROUND.md— full domain knowledge (never truncated)AGENT_CONTEXT.md— last ~5000 chars of iteration history (scores, failure taxonomy, what worked)PLAN.md— current strategy, hypotheses queue, what has worked/failed- Live failures — the exact inputs, expected outputs, and actual outputs from the latest eval
What Claude Returns: ExperimentPlan
{
"hypothesis": "The model confuses 'hover' with 'loiter'. Generating 50 targeted examples should fix this.",
"config_patches": {
"training": {"learning_rate": 3e-5, "epochs_max": 5}
},
"forge_yaml_patches": {
"lora": {"r": 64}
},
"background_additions": "## Discovered Pattern\n\n'hover' means maintain altitude in place. 'loiter' means circle a point. These are different commands.",
"augment_focus": "Examples distinguishing hover vs loiter commands with varied phrasing",
"augment_n": 60
}
hypothesis— plain English description of what we think will helpconfig_patches— temporary config overrides for this iteration only (not written to disk)forge_yaml_patches— permanent config changes written toforge.yamlon diskbackground_additions— new domain knowledge appended toBACKGROUND.mdaugment_focus— what kind of examples to generate this iterationaugment_n— how many examples to generate
After Each Iteration
AGENT_CONTEXT.mdis updated with the new score, hypothesis result, and failure countsPLAN.mdis updated with the new strategy and hypotheses queue- Git commit:
forge iter N: 87.3% — hypothesis text (first 60 chars) - Heartbeat written to
output/flywheel_heartbeat.json
Heartbeat File
{
"iteration": 4,
"status": "running",
"accuracy": 0.873,
"timestamp": "2025-01-15T14:23:01Z"
}
Status values: running, completed, error. Check this file to monitor long runs without attaching to the process.
forge.yaml Reference
| Field | Type | Default | Description |
|---|---|---|---|
model.base |
string | unsloth/functiongemma-270m-it |
HuggingFace model ID |
model.max_seq_len |
int | 2048 |
Max token sequence length |
model.load_in_4bit |
bool | true |
4-bit quantization |
lora.r |
int | 32 |
LoRA rank |
lora.alpha |
int | 32 |
LoRA alpha (usually = r) |
lora.dropout |
float | 0.05 |
LoRA dropout |
lora.target_modules |
list | [q_proj, k_proj, ...] |
Layers to train |
training.epochs_min |
int | 2 |
Minimum epochs |
training.epochs_max |
int | 4 |
Maximum epochs |
training.batch_size |
int | 4 |
Examples per gradient step |
training.gradient_accumulation_steps |
int | 1 |
Effective batch multiplier |
training.learning_rate |
float | 5.0e-5 |
AdamW learning rate |
training.warmup_ratio |
float | 0.1 |
Fraction of steps for LR warmup |
training.weight_decay |
float | 0.01 |
L2 regularization |
training.lr_scheduler |
string | cosine |
LR schedule (cosine/linear/constant) |
training.seed |
int | 42 |
Random seed |
eval.target_accuracy |
float | 0.95 |
Stop flywheel when reached |
eval.scorer |
string | json_cmd |
Scoring method (see below) |
eval.max_new_tokens |
int | 256 |
Max output tokens during eval |
eval.forbidden_commands |
list | [] |
Outputs that are always wrong |
dataset.seed |
path | dataset/seed.jsonl |
Seed examples |
dataset.gold |
path | dataset/gold.jsonl |
Eval set (locked) |
dataset.canonical |
path | dataset/canonical.jsonl |
Training data |
promotion.hf_repo |
string | null |
HuggingFace push target |
promotion.private |
bool | true |
Push as private repo |
flywheel.max_iterations |
int | 10 |
Max autonomous iterations |
flywheel.augment_per_failure |
int | 20 |
Examples per failure category |
flywheel.datagen_model |
string | claude-sonnet-4-6 |
Claude model for datagen |
flywheel.planner_model |
string | claude-sonnet-4-6 |
Claude model for planning |
compute.device |
string | cuda:0 |
Training device |
compute.remote |
string | null |
SSH remote (e.g. ssh://user@host) |
Scorer types:
| Scorer | Description |
|---|---|
exact |
Exact string match (case-sensitive) |
json_cmd |
Parses JSON, checks cmd field only |
custom |
Calls scorer.py in project root with (expected, actual) -> float |
Writing Good BACKGROUND.md
BACKGROUND.md is the most important document in your project. Claude reads it in full before generating every training example. The better this file, the more realistic and diverse your training data.
Here is a complete example for a drone command routing task:
# Background — drone-router
## Task
Route incoming natural language user messages to one of 8 drone commands.
Return a structured JSON object with the command name, arguments, and confidence.
Commands (exact names — no synonyms):
- `takeoff` — lift off from current position; args: {altitude_m: float}
- `land` — descend and land; args: {}
- `goto_waypoint` — fly to a location; args: {grid: str} OR {lat: float, lon: float}
- `rtl` — return to launch point; args: {}
- `hover` — maintain current position and altitude; args: {duration_s: int | null}
- `loiter` — circle a point; args: {radius_m: float, duration_s: int | null}
- `ascend` — increase altitude; args: {delta_m: float}
- `descend` — decrease altitude; args: {delta_m: float}
- `unknown` — input doesn't map to a valid command; args: {}
## Output Format
Always return valid JSON. Never return plain text, never include prose before or after.
Schema:
{
"cmd": "<command_name>",
"args": { ... },
"confidence": <0.0–1.0>
}
Correct example:
Input: "Fly to waypoint B4"
Output: {"cmd": "goto_waypoint", "args": {"grid": "B4"}, "confidence": 0.97}
Wrong (missing args key):
Output: {"cmd": "goto_waypoint", "confidence": 0.97}
Wrong (invalid cmd name):
Output: {"cmd": "fly_to", "args": {"grid": "B4"}, "confidence": 0.97}
Wrong (plain text):
Output: The drone should go to B4.
## Edge Cases
- "Return home" / "go home" / "RTL" → cmd:rtl (NOT goto_waypoint)
- "Hover" / "hold position" / "stay here" → cmd:hover (NOT loiter)
- "Circle the area" / "loiter over" → cmd:loiter (NOT hover)
- Coordinates can be grid refs ("B4") or lat/lon: "37.7749,-122.4194"
- "Go up 10 meters" → cmd:ascend, args.delta_m:10.0
- "Go up" with no distance → cmd:ascend, args.delta_m:5.0 (default)
- Empty or gibberish input → cmd:unknown, confidence:0.0
- Multiple commands in one message → pick the primary intent
- "Land at base" vs "Land now" → both cmd:land, args:{} (base is default)
## Common Failure Patterns
- Model confuses `hover` and `loiter` — they are DIFFERENT commands
- Drops `args` key entirely when no args needed — must always include args: {}
- Outputs plain text when input contains a question mark
- Hallucinates command names: "ascend_fast", "fly_to", "go_to" — ALL INVALID
- Omits `confidence` field when input is ambiguous
- Returns confidence:1.0 for ambiguous inputs (should be 0.5–0.7)
## What Good Examples Look Like
Input variety:
- Very short: "up", "land", "B4"
- Casual: "take it up a bit", "bring it back"
- Formal: "initiate return-to-launch sequence"
- Ambiguous: "go higher" (ascend), "stay there" (hover)
- With noise: "uh, like, go to... B4 please?"
Output correctness:
- cmd is always from the exact allowed list (case-sensitive)
- args always present, even if {}
- confidence reflects actual certainty (not always 1.0)
Difficulty range:
- 30% easy (unambiguous, direct)
- 50% medium (some inference needed)
- 20% hard (edge cases, ambiguity, unusual phrasing)
Eval Set Design
The eval set (dataset/gold.jsonl) is the only objective measure of progress. Treat it like a test suite.
Golden rules:
- Never train on it. forge enforces this — gold examples are never included in
canonical.jsonl. - Lock it early. Create your gold set before any training and never modify it. If you keep adding to it, your accuracy numbers aren't comparable across iterations.
- Make it representative. Your gold set should cover all difficulty levels and edge cases, not just easy examples.
- Aim for 50–200 examples. Too few → noisy accuracy. Too many → slow eval loop.
How to create a gold set:
# Start with your best hand-crafted examples
# Format: one JSON per line, messages format
echo '{"messages":[{"role":"user","content":"Fly to B4"},{"role":"assistant","content":"{\"cmd\":\"goto_waypoint\",\"args\":{\"grid\":\"B4\"},\"confidence\":0.97}"}]}' >> dataset/gold.jsonl
# Or use forge generate to generate candidates, then hand-review
forge generate --n 200 --output /tmp/candidates.jsonl
# Review /tmp/candidates.jsonl, pick the ones you trust, add to gold.jsonl
Format:
{"messages": [{"role": "user", "content": "<input>"}, {"role": "assistant", "content": "<expected output>"}]}
Monitoring Long Runs
A flywheel run can take hours. Here's how to monitor without babysitting it.
Heartbeat File
watch -n 10 cat output/flywheel_heartbeat.json
{
"iteration": 7,
"status": "running",
"accuracy": 0.912,
"timestamp": "2025-01-15T14:23:01Z"
}
Experiment Log
output/experiments.jsonl — one record per iteration with full metadata:
# Show accuracy progression
cat output/experiments.jsonl | python3 -c "
import sys, json
for line in sys.stdin:
r = json.loads(line)
print(f\"iter {r['iteration']}: {r['accuracy']:.1%} — {r['hypothesis'][:60]}\")
"
Agent Context
AGENT_CONTEXT.md — human-readable log of scores, failure patterns, and hypotheses. Updated after every eval.
tail -100 AGENT_CONTEXT.md
Git Log
Every iteration is committed. git log --oneline shows the full history at a glance:
a1b2c3d forge iter 8: 93.2% — Increasing LoRA rank to 64 for better capacity
b2c3d4e forge iter 7: 91.5% — Targeting hover/loiter confusion with 60 new examples
c3d4e5f forge iter 6: 88.1% — Reducing LR after oscillation detected
...
Crash Recovery
- Check
output/flywheel_heartbeat.json—statuswill be"error"witherrorfield - Check
output/flywheel.logfor the traceback - Run
forge statusto see last known accuracy - Fix the issue (bad data, OOM, etc.)
- Resume:
forge flywheel --iters 5 --skip-train # eval existing adapter first # or forge flywheel --iters 5 # retrain from scratch
Examples
forge-math-demo Walkthrough
A working example lives in examples/forge-math-demo/. It trains a model to answer arithmetic questions as structured JSON.
cd examples/forge-math-demo
# Look at what's pre-populated
cat BACKGROUND.md # describes the math→JSON task
cat dataset/gold.jsonl # 20 held-out eval examples
# Run the flywheel (no GPU needed for demo — uses cpu + tiny model)
forge flywheel --iters 3
# Check what happened
forge status
git log --oneline
cat output/experiments.jsonl | python3 -m json.tool | head -40
Expected output after 3 iterations:
forge status
─────────────────────────────────────
Project: forge-math-demo
Best score: 0.850 (iter 2)
Last score: 0.850
Dataset: 142 training / 20 gold examples
Iterations: 3 / 10
─────────────────────────────────────
Domain Knowledge Workflow (Pretrain → Probe → Fine-tune)
For tasks that require deep domain knowledge — medical, legal, scientific, or proprietary — you can pretrain the base model on raw documents before task fine-tuning. This "smart base" absorbs domain knowledge first, making subsequent LoRA fine-tuning much more effective.
┌──────────────────────────────────────────────────────────────┐
│ Domain Knowledge Acquisition Pipeline │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌─────────┐ │
│ │ probe │ │ pretrain │ │ probe │ │flywheel │ │
│ │ --tag pre│──▶│ (corpus) │──▶│ --tag │──▶│(on smart│ │
│ │ baseline │ │ + merge │ │ post │ │ base) │ │
│ └──────────┘ └───────────┘ └──────────┘ └─────────┘ │
│ │ │ │ │
│ Scores 30 Qs Causal LM on Scores same Task LoRA on │
│ before domain raw text docs 30 Qs after merged base │
│ training (no chat fmt) training model │
└──────────────────────────────────────────────────────────────┘
Step-by-step
1. Create probe questions (dataset/probe.jsonl):
{"question": "What is the half-life of carbon-14?", "ideal": "Approximately 5,730 years."}
{"question": "What does HIPAA regulate?", "ideal": "The privacy and security of protected health information (PHI)."}
30 questions is a good number — enough to measure meaningful delta, fast enough to run twice.
2. Baseline probe (run once — permanent record):
forge probe --tag pre
3. Add domain corpus to dataset/corpus/:
dataset/corpus/
├── technical_manual.pdf.txt # convert PDFs externally
├── domain_docs.md
└── training_data.jsonl # {"text": "..."}
4. Pretrain (causal LM on raw text, auto-merges to 16-bit):
forge pretrain # uses corpus_dir from forge.yaml
forge pretrain --epochs 2 # more epochs for smaller corpora
Output: output/pretrain_merged/ — this is your new base model.
5. Post-pretrain probe:
forge probe --tag post --adapter output/pretrain_merged
forge probe --compare # shows delta table
6. Update forge.yaml to use the merged model:
model:
base: output/pretrain_merged # was: unsloth/functiongemma-270m-it
7. Run task fine-tuning on the smarter base:
forge flywheel --iters 10
Key rules
- Never fine-tune LoRA on top of the pretrain LoRA adapter — always merge first (LoRA-on-LoRA = catastrophic forgetting)
- The merge uses Unsloth's
save_pretrained_mergedwithmerged_16bit— avoids the Gemma3 tokenizer breakage fromAutoModelForCausalLM - Probe questions must stay fixed after the pre-probe run — changing them breaks the comparison
- Pretrain LR (2e-5) is intentionally lower than fine-tune LR (5e-5) — domain absorption needs gentle updates
Architecture
Each module in forge/core/ has a single responsibility:
| Module | Description |
|---|---|
forge/core/config.py |
Load and validate forge.yaml. Handles defaults, field merging, and project root discovery by walking up from cwd. |
forge/core/trainer.py |
LoRA fine-tuning via Unsloth. Wraps FastLanguageModel, applies training config, saves adapter to output/adapter/. |
forge/core/eval_runner.py |
Run model inference on gold.jsonl, score each example via the configured scorer, return accuracy + failure list. |
forge/core/scorer.py |
Scoring strategies: exact (string match), json_cmd (parse JSON, check cmd field), custom (call scorer.py). |
forge/core/augmentor.py |
Build Claude prompts from BACKGROUND.md + AGENT_CONTEXT.md + failures, call Claude API, parse and append to canonical.jsonl. |
forge/core/planner.py |
Maintain AGENT_CONTEXT.md and PLAN.md. Call Claude with full context to get ExperimentPlan. Apply forge_yaml_patches to disk. |
forge/core/experiment_planner.py |
ExperimentPlan dataclass and JSON parsing/validation. |
forge/core/dataset.py |
JSONL loading, deduplication, seed merging, train/gold split helpers. |
forge/core/context.py |
Read/write AGENT_CONTEXT.md: append iteration summaries, truncate to MAX_CONTEXT_CHARS. |
CLI layer (forge/commands/): thin Click commands that call into forge/core/. Each command is a separate file for easy testing and extension.
Contributing
git clone https://github.com/your-org/slm-pipeline
cd slm-pipeline
uv pip install -e ".[dev]"
pytest tests/ -q # 100+ tests, all mocked (no GPU, no API key needed)
Contributions welcome. Please add tests for any new command or core module change.
License
Apache 2.0 — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llamafarm_forge-0.3.1.tar.gz.
File metadata
- Download URL: llamafarm_forge-0.3.1.tar.gz
- Upload date:
- Size: 152.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2403fda91d2da2b9e7413b664f3ccf639894efa73b6e6d7eea986c06c52b6b82
|
|
| MD5 |
519eceb83441d48bbba3f8295f3ff089
|
|
| BLAKE2b-256 |
d40359468d842540d252499f55dd4b5791621b52996c8f0a606f36c0e368c95e
|
File details
Details for the file llamafarm_forge-0.3.1-py3-none-any.whl.
File metadata
- Download URL: llamafarm_forge-0.3.1-py3-none-any.whl
- Upload date:
- Size: 126.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af116f37846d7c03c1f83a030aac395ffeb3f62d7684da2325e80922678121c0
|
|
| MD5 |
5fade44193f8ce5c6d9a6e7ccc8c86aa
|
|
| BLAKE2b-256 |
ab5199639ede5cf0f85628b69875c461d8d1bf90bc81df9d7626ea2eafc6ebb9
|