Self-improving SLM training platform — autonomous fine-tuning with Claude-powered experiment planning

These details have not been verified by PyPI

Project links

Project description

forge — Self-Improving SLM Training Platform

Turn any directory into a fine-tuning project that trains, evaluates, and improves itself toward a target accuracy.

forge is a CLI tool that wraps the full small-language-model training lifecycle — data generation, training, evaluation, promotion, and augmentation — into a single autonomous loop called the flywheel. You define the task; forge does the rest.

Quick Start

pip install slm-forge          # or: uv pip install slm-forge
forge init my-task             # scaffold a new project
cd my-task
# edit BACKGROUND.md and mission.md, add examples to dataset/gold.jsonl
forge flywheel --iters 10      # start the autonomous loop

Running on a Remote GPU (e.g. pasture1)

forge is designed to run on any CUDA machine. For a remote GPU host accessed via SSH:

1. SSH in and activate the project

ssh -i ~/.ssh/id_ed25519 user@your-gpu-host
cd /path/to/your-forge-project
source .venv/bin/activate

2. Run in a detached tmux session

# Launch flywheel in background (survives SSH disconnect)
tmux new-session -d -s forge-run \
  'forge flywheel --iters 10 2>&1 | tee output/flywheel.log'

# Attach to watch live output
tmux attach -t forge-run
# Detach without killing: Ctrl-B then D

# Tail log without attaching
tail -f output/flywheel.log

3. Monitor progress

# Quick accuracy check from last eval
cat output/eval_results.json | python3 -c \
  "import sys,json; r=json.load(sys.stdin); print(f\"{r['accuracy']:.1%} ({r['correct']}/{r['total']})\")"

# Per-command breakdown
cat output/eval_results.json | python3 -c "
import sys, json
r = json.load(sys.stdin)
print(f'Accuracy: {r[\"accuracy\"]:.1%}  Failures: {r[\"failure_count\"]}')
for cmd, d in sorted(r['per_cmd'].items()):
    pct = d['correct']/d['total']*100 if d['total'] else 0
    bar = '✅' if pct == 100 else ('⚠️' if pct >= 80 else '❌')
    print(f'  {bar} {cmd:20s} {d[\"correct\"]}/{d[\"total\"]} ({pct:.0f}%)')
"

# Watch heartbeat file (updated after every iteration)
watch -n 10 cat output/flywheel_heartbeat.json

4. Push a manual dataset patch mid-run

# From your local machine: copy patch file to the GPU host
scp -i ~/.ssh/id_ed25519 /tmp/patch.jsonl user@your-gpu-host:/path/to/project/dataset/

# On the GPU host: merge and bump version
python3 -c "
import json
base = [json.loads(l) for l in open('dataset/canonical.jsonl')]
patch = [json.loads(l) for l in open('dataset/patch.jsonl')]
existing = {next(m['content'] for m in e['messages'] if m['role']=='user') for e in base}
new = [e for e in patch if next(m['content'] for m in e['messages'] if m['role']=='user') not in existing]
print(f'Net new: {len(new)}')
with open('dataset/canonical.jsonl', 'a') as f:
    for e in new: f.write(json.dumps(e) + '\n')
"
# Then retrain: forge train && forge eval

How It Works

The flywheel is a closed loop. Each iteration, Claude acts as an ML experiment planner: it reads what has been tried, hypothesizes what might work better, patches config, generates targeted training data, trains the model, and evaluates it. If the model improved, the adapter is promoted. Then it commits everything to git and repeats.

┌─────────────────────────────────────────────────────────────┐
│                     forge flywheel loop                      │
│                                                             │
│  ┌───────┐    ┌───────┐    ┌──────┐    ┌─────────┐         │
│  │ plan  │───▶│ train │───▶│ eval │───▶│ promote │         │
│  └───────┘    └───────┘    └──────┘    └────┬────┘         │
│      ▲                                       │              │
│      │         ┌─────────┐    ┌────────┐     │              │
│      └─────────│  commit │◀───│augment │◀────┘              │
│                └─────────┘    └────────┘                    │
│                                                             │
│  Stops when: target_accuracy reached OR max_iterations hit  │
└─────────────────────────────────────────────────────────────┘

Step by step:

Plan — Claude reads BACKGROUND.md, AGENT_CONTEXT.md, PLAN.md, and recent eval failures. It returns an ExperimentPlan with: a hypothesis, config patches to try this iteration, new background knowledge to append, and an augmentation focus.
Train — forge applies config patches and trains the LoRA adapter for the planned number of epochs.
Eval — forge evaluates the adapter against dataset/gold.jsonl. No cheating: gold data is never used in training.
Promote — if accuracy improved, the adapter is snapshotted to output/best_adapter_<acc>/.
Augment — Claude generates targeted training examples focused on the eval failures.
Commit — git commit captures the full state: data, config, context files, and adapter snapshot.
Repeat — AGENT_CONTEXT.md and PLAN.md are updated, and the loop restarts.

Installation

From PyPI

uv pip install slm-forge
# or
pip install slm-forge

From source

git clone https://github.com/your-org/slm-pipeline
cd slm-pipeline
uv pip install -e .

Requirements

Requirement	Notes
Python 3.10+	3.11 recommended
CUDA GPU	16 GB+ VRAM recommended (24 GB for larger models)
Unsloth	For fast LoRA training — `pip install unsloth`
`ANTHROPIC_API_KEY`	For datagen and planning — set in environment

No GPU for experimentation? You can still run forge generate, forge audit, and forge status on CPU. Only forge train and forge eval need CUDA.

Project Structure

Running forge init my-task creates:

my-task/
├── forge.yaml              # All config: model, LoRA, training, eval, flywheel
├── BACKGROUND.md           # Domain knowledge for Claude datagen (edit this!)
├── mission.md              # Task description for Claude datagen
├── system_prompt.md        # Exact system prompt used at train + eval time
├── llm.txt                 # Briefing doc for any AI agent starting fresh
├── .gitignore              # Ignores output/ except best_adapter/ and best_score.json
│
├── dataset/
│   ├── seed.jsonl          # Your hand-crafted seed examples (optional)
│   ├── gold.jsonl          # LOCKED eval set — never train on this
│   └── canonical.jsonl     # Training data (forge generates + audits this)
│
└── output/                 # Created at runtime
    ├── adapter/            # Current training output
    ├── best_adapter_*/     # Snapshots of best adapters (one per improvement)
    ├── best_score.json     # Best accuracy achieved so far
    ├── experiments.jsonl   # Full log of every iteration
    ├── flywheel_heartbeat.json  # Live status for monitoring
    └── flywheel.log        # Verbose run log

Auto-created during flywheel:

├── AGENT_CONTEXT.md        # Auto-updated iteration history (scores, failures)
└── PLAN.md                 # Claude's current strategy and hypotheses

Key Files Explained

`forge.yaml` — All Configuration

The single source of truth for everything. Here is a fully-annotated example:

name: my-task              # project name (informational)

model:
  base: unsloth/functiongemma-270m-it  # HuggingFace model ID to fine-tune
  max_seq_len: 2048        # maximum token sequence length
  load_in_4bit: true       # 4-bit quantization (saves VRAM, recommended)

lora:
  r: 32                    # LoRA rank — higher = more capacity, slower training
  alpha: 32                # LoRA alpha — usually set equal to r
  dropout: 0.05            # LoRA dropout regularization
  target_modules:          # which linear layers to train (model-specific)
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

training:
  epochs_min: 2            # minimum epochs to run (even if loss converges early)
  epochs_max: 4            # maximum epochs before stopping
  batch_size: 4            # examples per gradient step
  gradient_accumulation_steps: 1  # effective batch = batch_size × this
  learning_rate: 5.0e-5    # AdamW learning rate
  warmup_ratio: 0.1        # fraction of steps for LR warmup
  weight_decay: 0.01       # L2 regularization
  lr_scheduler: cosine     # cosine | linear | constant
  seed: 42                 # random seed for reproducibility

eval:
  target_accuracy: 0.95    # stop the flywheel when this accuracy is reached
  scorer: json_cmd         # exact | json_cmd | custom
  max_new_tokens: 256      # max tokens for model output during eval
  forbidden_commands: []   # outputs that are always wrong (safety)

dataset:
  seed: dataset/seed.jsonl      # seed examples (copied into canonical on first run)
  gold: dataset/gold.jsonl      # locked eval set
  canonical: dataset/canonical.jsonl  # training data (grows each iteration)

promotion:
  hf_repo: null            # HuggingFace repo to push to (e.g. my-org/my-model)
  private: true            # push as private repo

flywheel:
  max_iterations: 10       # max autonomous iterations before stopping
  augment_per_failure: 20  # synthetic examples to generate per failure category
  datagen_model: claude-sonnet-4-6   # Claude model for data generation
  planner_model: claude-sonnet-4-6   # Claude model for experiment planning

compute:
  device: cuda:0           # which GPU to use (cuda:0, cuda:1, cpu)
  remote: null             # e.g. ssh://user@host for remote execution

`BACKGROUND.md` — Domain Knowledge for Datagen

This is the most important file for data quality. Claude reads it in full (never truncated) before generating every training example. Think of it as your prompt engineering for synthetic data.

See Writing Good BACKGROUND.md below.

What goes here:

The exact task description
The required output format with a complete example
Every edge case you know about (with correct answers)
Common failure patterns you've observed
What makes a good vs bad training example

What does NOT go here:

Iteration history, scores, or run logs — those belong in AGENT_CONTEXT.md (auto-managed)
Speculation or hypotheses — those belong in PLAN.md (auto-managed)

`system_prompt.md` — The Model's System Prompt

The exact system prompt that is injected at both training time and eval time. This is the single source of truth — train.py, eval.py, and datagen all load from this file.

Edit with care. Changing this file mid-training is a major disruption because the model was trained on a different prompt.

`mission.md` — Task Description for Claude

A natural language description of the task, fed to Claude when generating training data. Unlike BACKGROUND.md (which is highly structured), mission.md is free-form prose explaining why the task matters, what the inputs look like, and any constraints Claude should respect.

`BACKGROUND.md` vs `AGENT_CONTEXT.md`

	`BACKGROUND.md`	`AGENT_CONTEXT.md`
Author	You (the human)	forge (auto-managed)
Content	Static domain knowledge	Dynamic iteration history
Update frequency	Rarely, manually	After every eval
Truncated?	Never	Yes (keeps most recent)
Purpose	Ground truth for datagen	Context for planning

BACKGROUND.md should be written once and kept accurate. AGENT_CONTEXT.md is a running log of scores, failures, and hypotheses — never edit it manually.

CLI Reference

`forge init <name>`

Scaffold a new forge project.

forge init my-router
forge init my-router --dir /path/to/existing-dir
forge init my-router --from /path/to/old-project  # migrate from llamadrone format

`forge generate`

Generate synthetic training data via batched Claude API calls. Fires all batches concurrently, validates each example (assistant must be valid JSON, user must be non-empty), and deduplicates against existing canonical.jsonl.

forge generate                          # 500 examples, batch size 50
forge generate --n 200 --batch-size 25  # 200 examples in batches of 25
forge generate --n 1000 --output dataset/extra.jsonl
forge generate -p /path/to/project      # specify project dir explicitly

`forge audit [--fix]`

Claude reviews every example in canonical.jsonl for format errors, factual mistakes, and quality issues. With --fix, it rewrites bad examples in place.

forge audit              # print report, no changes
forge audit --fix        # fix errors automatically
forge audit --fix --min-severity warning  # fix warnings too

Audit report columns: index, severity (error/warning/info/ok), issue, suggested_fix.

`forge train [--epochs N]`

Train the LoRA adapter on dataset/canonical.jsonl. Output goes to output/adapter/.

forge train              # use epochs from forge.yaml
forge train --epochs 3   # override epoch count

`forge eval`

Evaluate the current adapter against dataset/gold.jsonl. Prints accuracy and writes failures to output/eval_failures.jsonl.

forge eval
forge eval -p /path/to/project

`forge flywheel --iters N`

Run the full autonomous loop. This is the main command.

forge flywheel                     # run until target_accuracy or max_iterations
forge flywheel --iters 5           # run exactly 5 iterations
forge flywheel --skip-train        # skip training on first iteration (eval existing adapter)
forge flywheel --iters 10 -p ./my-task

`forge status`

Show current project state: best accuracy, dataset sizes, last iteration, heartbeat status.

forge status
forge status -p /path/to/project

`forge promote`

Manually snapshot the current adapter to output/best_adapter_<timestamp>/. The flywheel does this automatically when accuracy improves.

forge promote

`forge push`

Upload the best adapter to HuggingFace (requires hf_repo in forge.yaml and HF_TOKEN).

forge push
HF_TOKEN=hf_xxx forge push

`forge clean`

Remove all augmented (flywheel-generated) examples from canonical.jsonl, keeping only your seed examples and hand-crafted data. Useful when starting a fresh training run.

forge clean
forge clean --keep-seed   # keep seed.jsonl examples only

`forge patch "description"`

One-shot targeted data generation based on a plain-English description of what to generate. Useful for quickly patching gaps without running the full flywheel.

forge patch "more examples where the input contains coordinates as lat/lon floats"
forge patch "edge cases where the user says 'abort' vs 'cancel'" --n 50

`forge augment`

Generate training data targeted at the most recent eval failures (reads output/eval_failures.jsonl). The flywheel does this automatically; use this command for manual augmentation.

forge augment
forge augment --n 100    # generate 100 examples

The Autonomous Flywheel

The flywheel is forge's key innovation. Each iteration, Claude acts as an ML experiment planner with full context of everything that has been tried.

What Claude Reads Each Iteration

BACKGROUND.md — full domain knowledge (never truncated)
AGENT_CONTEXT.md — last ~5000 chars of iteration history (scores, failure taxonomy, what worked)
PLAN.md — current strategy, hypotheses queue, what has worked/failed
Live failures — the exact inputs, expected outputs, and actual outputs from the latest eval

What Claude Returns: `ExperimentPlan`

{
  "hypothesis": "The model confuses 'hover' with 'loiter'. Generating 50 targeted examples should fix this.",
  "config_patches": {
    "training": {"learning_rate": 3e-5, "epochs_max": 5}
  },
  "forge_yaml_patches": {
    "lora": {"r": 64}
  },
  "background_additions": "## Discovered Pattern\n\n'hover' means maintain altitude in place. 'loiter' means circle a point. These are different commands.",
  "augment_focus": "Examples distinguishing hover vs loiter commands with varied phrasing",
  "augment_n": 60
}

hypothesis — plain English description of what we think will help
config_patches — temporary config overrides for this iteration only (not written to disk)
forge_yaml_patches — permanent config changes written to forge.yaml on disk
background_additions — new domain knowledge appended to BACKGROUND.md
augment_focus — what kind of examples to generate this iteration
augment_n — how many examples to generate

After Each Iteration

AGENT_CONTEXT.md is updated with the new score, hypothesis result, and failure counts
PLAN.md is updated with the new strategy and hypotheses queue
Git commit: forge iter N: 87.3% — hypothesis text (first 60 chars)
Heartbeat written to output/flywheel_heartbeat.json

Heartbeat File

{
  "iteration": 4,
  "status": "running",
  "accuracy": 0.873,
  "timestamp": "2025-01-15T14:23:01Z"
}

Status values: running, completed, error. Check this file to monitor long runs without attaching to the process.

forge.yaml Reference

Field	Type	Default	Description
`model.base`	string	`unsloth/functiongemma-270m-it`	HuggingFace model ID
`model.max_seq_len`	int	`2048`	Max token sequence length
`model.load_in_4bit`	bool	`true`	4-bit quantization
`lora.r`	int	`32`	LoRA rank
`lora.alpha`	int	`32`	LoRA alpha (usually = r)
`lora.dropout`	float	`0.05`	LoRA dropout
`lora.target_modules`	list	`[q_proj, k_proj, ...]`	Layers to train
`training.epochs_min`	int	`2`	Minimum epochs
`training.epochs_max`	int	`4`	Maximum epochs
`training.batch_size`	int	`4`	Examples per gradient step
`training.gradient_accumulation_steps`	int	`1`	Effective batch multiplier
`training.learning_rate`	float	`5.0e-5`	AdamW learning rate
`training.warmup_ratio`	float	`0.1`	Fraction of steps for LR warmup
`training.weight_decay`	float	`0.01`	L2 regularization
`training.lr_scheduler`	string	`cosine`	LR schedule (cosine/linear/constant)
`training.seed`	int	`42`	Random seed
`eval.target_accuracy`	float	`0.95`	Stop flywheel when reached
`eval.scorer`	string	`json_cmd`	Scoring method (see below)
`eval.max_new_tokens`	int	`256`	Max output tokens during eval
`eval.forbidden_commands`	list	`[]`	Outputs that are always wrong
`dataset.seed`	path	`dataset/seed.jsonl`	Seed examples
`dataset.gold`	path	`dataset/gold.jsonl`	Eval set (locked)
`dataset.canonical`	path	`dataset/canonical.jsonl`	Training data
`promotion.hf_repo`	string	`null`	HuggingFace push target
`promotion.private`	bool	`true`	Push as private repo
`flywheel.max_iterations`	int	`10`	Max autonomous iterations
`flywheel.augment_per_failure`	int	`20`	Examples per failure category
`flywheel.datagen_model`	string	`claude-sonnet-4-6`	Claude model for datagen
`flywheel.planner_model`	string	`claude-sonnet-4-6`	Claude model for planning
`compute.device`	string	`cuda:0`	Training device
`compute.remote`	string	`null`	SSH remote (e.g. `ssh://user@host`)

Scorer types:

Scorer	Description
`exact`	Exact string match (case-sensitive)
`json_cmd`	Parses JSON, checks `cmd` field only
`custom`	Calls `scorer.py` in project root with `(expected, actual) -> float`

Writing Good BACKGROUND.md

BACKGROUND.md is the most important document in your project. Claude reads it in full before generating every training example. The better this file, the more realistic and diverse your training data.

Here is a complete example for a drone command routing task:

# Background — drone-router

## Task

Route incoming natural language user messages to one of 8 drone commands.
Return a structured JSON object with the command name, arguments, and confidence.

Commands (exact names — no synonyms):
- `takeoff` — lift off from current position; args: {altitude_m: float}
- `land` — descend and land; args: {}
- `goto_waypoint` — fly to a location; args: {grid: str} OR {lat: float, lon: float}
- `rtl` — return to launch point; args: {}
- `hover` — maintain current position and altitude; args: {duration_s: int | null}
- `loiter` — circle a point; args: {radius_m: float, duration_s: int | null}
- `ascend` — increase altitude; args: {delta_m: float}
- `descend` — decrease altitude; args: {delta_m: float}
- `unknown` — input doesn't map to a valid command; args: {}

## Output Format

Always return valid JSON. Never return plain text, never include prose before or after.

Schema:
{
  "cmd": "<command_name>",
  "args": { ... },
  "confidence": <0.0–1.0>
}

Correct example:
Input:  "Fly to waypoint B4"
Output: {"cmd": "goto_waypoint", "args": {"grid": "B4"}, "confidence": 0.97}

Wrong (missing args key):
Output: {"cmd": "goto_waypoint", "confidence": 0.97}

Wrong (invalid cmd name):
Output: {"cmd": "fly_to", "args": {"grid": "B4"}, "confidence": 0.97}

Wrong (plain text):
Output: The drone should go to B4.

## Edge Cases

- "Return home" / "go home" / "RTL" → cmd:rtl (NOT goto_waypoint)
- "Hover" / "hold position" / "stay here" → cmd:hover (NOT loiter)
- "Circle the area" / "loiter over" → cmd:loiter (NOT hover)
- Coordinates can be grid refs ("B4") or lat/lon: "37.7749,-122.4194"
- "Go up 10 meters" → cmd:ascend, args.delta_m:10.0
- "Go up" with no distance → cmd:ascend, args.delta_m:5.0 (default)
- Empty or gibberish input → cmd:unknown, confidence:0.0
- Multiple commands in one message → pick the primary intent
- "Land at base" vs "Land now" → both cmd:land, args:{} (base is default)

## Common Failure Patterns

- Model confuses `hover` and `loiter` — they are DIFFERENT commands
- Drops `args` key entirely when no args needed — must always include args: {}
- Outputs plain text when input contains a question mark
- Hallucinates command names: "ascend_fast", "fly_to", "go_to" — ALL INVALID
- Omits `confidence` field when input is ambiguous
- Returns confidence:1.0 for ambiguous inputs (should be 0.5–0.7)

## What Good Examples Look Like

Input variety:
- Very short: "up", "land", "B4"
- Casual: "take it up a bit", "bring it back"
- Formal: "initiate return-to-launch sequence"
- Ambiguous: "go higher" (ascend), "stay there" (hover)
- With noise: "uh, like, go to... B4 please?"

Output correctness:
- cmd is always from the exact allowed list (case-sensitive)
- args always present, even if {}
- confidence reflects actual certainty (not always 1.0)

Difficulty range:
- 30% easy (unambiguous, direct)
- 50% medium (some inference needed)
- 20% hard (edge cases, ambiguity, unusual phrasing)

Eval Set Design

The eval set (dataset/gold.jsonl) is the only objective measure of progress. Treat it like a test suite.

Golden rules:

Never train on it. forge enforces this — gold examples are never included in canonical.jsonl.
Lock it early. Create your gold set before any training and never modify it. If you keep adding to it, your accuracy numbers aren't comparable across iterations.
Make it representative. Your gold set should cover all difficulty levels and edge cases, not just easy examples.
Aim for 50–200 examples. Too few → noisy accuracy. Too many → slow eval loop.

How to create a gold set:

# Start with your best hand-crafted examples
# Format: one JSON per line, messages format
echo '{"messages":[{"role":"user","content":"Fly to B4"},{"role":"assistant","content":"{\"cmd\":\"goto_waypoint\",\"args\":{\"grid\":\"B4\"},\"confidence\":0.97}"}]}' >> dataset/gold.jsonl

# Or use forge generate to generate candidates, then hand-review
forge generate --n 200 --output /tmp/candidates.jsonl
# Review /tmp/candidates.jsonl, pick the ones you trust, add to gold.jsonl

Format:

{"messages": [{"role": "user", "content": "<input>"}, {"role": "assistant", "content": "<expected output>"}]}

Monitoring Long Runs

A flywheel run can take hours. Here's how to monitor without babysitting it.

Heartbeat File

watch -n 10 cat output/flywheel_heartbeat.json

{
  "iteration": 7,
  "status": "running",
  "accuracy": 0.912,
  "timestamp": "2025-01-15T14:23:01Z"
}

Experiment Log

output/experiments.jsonl — one record per iteration with full metadata:

# Show accuracy progression
cat output/experiments.jsonl | python3 -c "
import sys, json
for line in sys.stdin:
    r = json.loads(line)
    print(f\"iter {r['iteration']}: {r['accuracy']:.1%} — {r['hypothesis'][:60]}\")
"

Agent Context

AGENT_CONTEXT.md — human-readable log of scores, failure patterns, and hypotheses. Updated after every eval.

tail -100 AGENT_CONTEXT.md

Git Log

Every iteration is committed. git log --oneline shows the full history at a glance:

a1b2c3d forge iter 8: 93.2% — Increasing LoRA rank to 64 for better capacity
b2c3d4e forge iter 7: 91.5% — Targeting hover/loiter confusion with 60 new examples
c3d4e5f forge iter 6: 88.1% — Reducing LR after oscillation detected
...

Crash Recovery

Check output/flywheel_heartbeat.json — status will be "error" with error field
Check output/flywheel.log for the traceback
Run forge status to see last known accuracy
Fix the issue (bad data, OOM, etc.)

Resume:

forge flywheel --iters 5 --skip-train  # eval existing adapter first
# or
forge flywheel --iters 5               # retrain from scratch

Examples

forge-math-demo Walkthrough

A working example lives in examples/forge-math-demo/. It trains a model to answer arithmetic questions as structured JSON.

cd examples/forge-math-demo

# Look at what's pre-populated
cat BACKGROUND.md        # describes the math→JSON task
cat dataset/gold.jsonl   # 20 held-out eval examples

# Run the flywheel (no GPU needed for demo — uses cpu + tiny model)
forge flywheel --iters 3

# Check what happened
forge status
git log --oneline
cat output/experiments.jsonl | python3 -m json.tool | head -40

Expected output after 3 iterations:

forge status
─────────────────────────────────────
  Project:    forge-math-demo
  Best score: 0.850 (iter 2)
  Last score: 0.850
  Dataset:    142 training / 20 gold examples
  Iterations: 3 / 10
─────────────────────────────────────

Domain Knowledge Workflow (Pretrain → Probe → Fine-tune)

For tasks that require deep domain knowledge — medical, legal, scientific, or proprietary — you can pretrain the base model on raw documents before task fine-tuning. This "smart base" absorbs domain knowledge first, making subsequent LoRA fine-tuning much more effective.

┌──────────────────────────────────────────────────────────────┐
│              Domain Knowledge Acquisition Pipeline            │
│                                                              │
│  ┌──────────┐   ┌───────────┐   ┌──────────┐   ┌─────────┐  │
│  │ probe    │   │ pretrain  │   │  probe   │   │flywheel │  │
│  │ --tag pre│──▶│ (corpus)  │──▶│ --tag    │──▶│(on smart│  │
│  │ baseline │   │ + merge   │   │ post     │   │  base)  │  │
│  └──────────┘   └───────────┘   └──────────┘   └─────────┘  │
│       │              │               │                        │
│  Scores 30 Qs   Causal LM on    Scores same    Task LoRA on  │
│  before domain  raw text docs   30 Qs after    merged base   │
│  training       (no chat fmt)   training       model         │
└──────────────────────────────────────────────────────────────┘

Step-by-step

1. Create probe questions (dataset/probe.jsonl):

{"question": "What is the half-life of carbon-14?", "ideal": "Approximately 5,730 years."}
{"question": "What does HIPAA regulate?", "ideal": "The privacy and security of protected health information (PHI)."}

30 questions is a good number — enough to measure meaningful delta, fast enough to run twice.

2. Baseline probe (run once — permanent record):

forge probe --tag pre

3. Add domain corpus to dataset/corpus/:

dataset/corpus/
├── technical_manual.pdf.txt    # convert PDFs externally
├── domain_docs.md
└── training_data.jsonl         # {"text": "..."}

4. Pretrain (causal LM on raw text, auto-merges to 16-bit):

forge pretrain                  # uses corpus_dir from forge.yaml
forge pretrain --epochs 2       # more epochs for smaller corpora

Output: output/pretrain_merged/ — this is your new base model.

5. Post-pretrain probe:

forge probe --tag post --adapter output/pretrain_merged
forge probe --compare           # shows delta table

6. Update forge.yaml to use the merged model:

model:
  base: output/pretrain_merged   # was: unsloth/functiongemma-270m-it

7. Run task fine-tuning on the smarter base:

forge flywheel --iters 10

Key rules

Never fine-tune LoRA on top of the pretrain LoRA adapter — always merge first (LoRA-on-LoRA = catastrophic forgetting)
The merge uses Unsloth's save_pretrained_merged with merged_16bit — avoids the Gemma3 tokenizer breakage from AutoModelForCausalLM
Probe questions must stay fixed after the pre-probe run — changing them breaks the comparison
Pretrain LR (2e-5) is intentionally lower than fine-tune LR (5e-5) — domain absorption needs gentle updates

Architecture

Each module in forge/core/ has a single responsibility:

Module	Description
`forge/core/config.py`	Load and validate `forge.yaml`. Handles defaults, field merging, and project root discovery by walking up from `cwd`.
`forge/core/trainer.py`	LoRA fine-tuning via Unsloth. Wraps `FastLanguageModel`, applies training config, saves adapter to `output/adapter/`.
`forge/core/eval_runner.py`	Run model inference on `gold.jsonl`, score each example via the configured scorer, return accuracy + failure list.
`forge/core/scorer.py`	Scoring strategies: `exact` (string match), `json_cmd` (parse JSON, check `cmd` field), `custom` (call `scorer.py`).
`forge/core/augmentor.py`	Build Claude prompts from `BACKGROUND.md` + `AGENT_CONTEXT.md` + failures, call Claude API, parse and append to `canonical.jsonl`.
`forge/core/planner.py`	Maintain `AGENT_CONTEXT.md` and `PLAN.md`. Call Claude with full context to get `ExperimentPlan`. Apply `forge_yaml_patches` to disk.
`forge/core/experiment_planner.py`	`ExperimentPlan` dataclass and JSON parsing/validation.
`forge/core/dataset.py`	JSONL loading, deduplication, seed merging, train/gold split helpers.
`forge/core/context.py`	Read/write `AGENT_CONTEXT.md`: append iteration summaries, truncate to `MAX_CONTEXT_CHARS`.

CLI layer (forge/commands/): thin Click commands that call into forge/core/. Each command is a separate file for easy testing and extension.

Contributing

git clone https://github.com/your-org/slm-pipeline
cd slm-pipeline
uv pip install -e ".[dev]"
pytest tests/ -q           # 100+ tests, all mocked (no GPU, no API key needed)

Contributions welcome. Please add tests for any new command or core module change.

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.2

Apr 4, 2026

0.4.1

Apr 1, 2026

0.4.0

Apr 1, 2026

0.3.1

Mar 31, 2026

This version

0.3.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llamafarm_forge-0.3.0-py3-none-any.whl (121.0 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file llamafarm_forge-0.3.0-py3-none-any.whl.

File metadata

Download URL: llamafarm_forge-0.3.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 121.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for llamafarm_forge-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c2e914df9751f718648a866a57360ac60c165af2c08287706910fad4b51efd2`
MD5	`f2cebb5fd659812db4d4b8ec80301cb7`
BLAKE2b-256	`f66a5a153f5a0f75c073acb9f7a2b8ce667a7a4694d188f1ae13fb6a932e08d4`

See more details on using hashes here.

llamafarm-forge 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

forge — Self-Improving SLM Training Platform

Quick Start

Running on a Remote GPU (e.g. pasture1)

1. SSH in and activate the project

2. Run in a detached tmux session

3. Monitor progress

4. Push a manual dataset patch mid-run

How It Works

Installation

From PyPI

From source

Requirements

Project Structure

Key Files Explained

forge.yaml — All Configuration

BACKGROUND.md — Domain Knowledge for Datagen

system_prompt.md — The Model's System Prompt

mission.md — Task Description for Claude

BACKGROUND.md vs AGENT_CONTEXT.md

CLI Reference

forge init <name>

forge generate

forge audit [--fix]

forge train [--epochs N]

forge eval

forge flywheel --iters N

forge status

forge promote

forge push

forge clean

forge patch "description"

forge augment

The Autonomous Flywheel

What Claude Reads Each Iteration

What Claude Returns: ExperimentPlan

After Each Iteration

Heartbeat File

forge.yaml Reference

Writing Good BACKGROUND.md

Eval Set Design

Monitoring Long Runs

Heartbeat File

Experiment Log

Agent Context

Git Log

Crash Recovery

Examples

forge-math-demo Walkthrough

Domain Knowledge Workflow (Pretrain → Probe → Fine-tune)

Step-by-step

Key rules

Architecture

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

`forge.yaml` — All Configuration

`BACKGROUND.md` — Domain Knowledge for Datagen

`system_prompt.md` — The Model's System Prompt

`mission.md` — Task Description for Claude

`BACKGROUND.md` vs `AGENT_CONTEXT.md`

`forge init <name>`

`forge generate`

`forge audit [--fix]`

`forge train [--epochs N]`

`forge eval`

`forge flywheel --iters N`

`forge status`

`forge promote`

`forge push`

`forge clean`

`forge patch "description"`

`forge augment`

What Claude Returns: `ExperimentPlan`