Skip to main content

Self-improving SLM training platform — autonomous fine-tuning with Claude-powered experiment planning

Project description

forge — Self-Improving SLM Training Platform

Turn any directory into a fine-tuning project that trains, evaluates, and improves itself toward a target accuracy.

forge is a CLI tool that wraps the full small-language-model training lifecycle — data generation, training, evaluation, promotion, and augmentation — into a single autonomous loop called the flywheel. You define the task; forge does the rest.


Quick Start

pip install slm-forge          # or: uv pip install slm-forge
forge init my-task             # scaffold a new project
cd my-task
# edit BACKGROUND.md and mission.md, add examples to dataset/gold.jsonl
forge flywheel --iters 10      # start the autonomous loop

Running on a Remote GPU (e.g. pasture1)

forge is designed to run on any CUDA machine. For a remote GPU host accessed via SSH:

1. SSH in and activate the project

ssh -i ~/.ssh/id_ed25519 user@your-gpu-host
cd /path/to/your-forge-project
source .venv/bin/activate

2. Run in a detached tmux session

# Launch flywheel in background (survives SSH disconnect)
tmux new-session -d -s forge-run \
  'forge flywheel --iters 10 2>&1 | tee output/flywheel.log'

# Attach to watch live output
tmux attach -t forge-run
# Detach without killing: Ctrl-B then D

# Tail log without attaching
tail -f output/flywheel.log

3. Monitor progress

# Quick accuracy check from last eval
cat output/eval_results.json | python3 -c \
  "import sys,json; r=json.load(sys.stdin); print(f\"{r['accuracy']:.1%} ({r['correct']}/{r['total']})\")"

# Per-command breakdown
cat output/eval_results.json | python3 -c "
import sys, json
r = json.load(sys.stdin)
print(f'Accuracy: {r[\"accuracy\"]:.1%}  Failures: {r[\"failure_count\"]}')
for cmd, d in sorted(r['per_cmd'].items()):
    pct = d['correct']/d['total']*100 if d['total'] else 0
    bar = '✅' if pct == 100 else ('⚠️' if pct >= 80 else '❌')
    print(f'  {bar} {cmd:20s} {d[\"correct\"]}/{d[\"total\"]} ({pct:.0f}%)')
"

# Watch heartbeat file (updated after every iteration)
watch -n 10 cat output/flywheel_heartbeat.json

4. Push a manual dataset patch mid-run

# From your local machine: copy patch file to the GPU host
scp -i ~/.ssh/id_ed25519 /tmp/patch.jsonl user@your-gpu-host:/path/to/project/dataset/

# On the GPU host: merge and bump version
python3 -c "
import json
base = [json.loads(l) for l in open('dataset/canonical.jsonl')]
patch = [json.loads(l) for l in open('dataset/patch.jsonl')]
existing = {next(m['content'] for m in e['messages'] if m['role']=='user') for e in base}
new = [e for e in patch if next(m['content'] for m in e['messages'] if m['role']=='user') not in existing]
print(f'Net new: {len(new)}')
with open('dataset/canonical.jsonl', 'a') as f:
    for e in new: f.write(json.dumps(e) + '\n')
"
# Then retrain: forge train && forge eval

How It Works

The flywheel is a closed loop. Each iteration, Claude acts as an ML experiment planner: it reads what has been tried, hypothesizes what might work better, patches config, generates targeted training data, trains the model, and evaluates it. If the model improved, the adapter is promoted. Then it commits everything to git and repeats.

┌─────────────────────────────────────────────────────────────┐
│                     forge flywheel loop                      │
│                                                             │
│  ┌───────┐    ┌───────┐    ┌──────┐    ┌─────────┐         │
│  │ plan  │───▶│ train │───▶│ eval │───▶│ promote │         │
│  └───────┘    └───────┘    └──────┘    └────┬────┘         │
│      ▲                                       │              │
│      │         ┌─────────┐    ┌────────┐     │              │
│      └─────────│  commit │◀───│augment │◀────┘              │
│                └─────────┘    └────────┘                    │
│                                                             │
│  Stops when: target_accuracy reached OR max_iterations hit  │
└─────────────────────────────────────────────────────────────┘

Step by step:

  1. Plan — Claude reads BACKGROUND.md, AGENT_CONTEXT.md, PLAN.md, and recent eval failures. It returns an ExperimentPlan with: a hypothesis, config patches to try this iteration, new background knowledge to append, and an augmentation focus.
  2. Train — forge applies config patches and trains the LoRA adapter for the planned number of epochs.
  3. Eval — forge evaluates the adapter against dataset/gold.jsonl. No cheating: gold data is never used in training.
  4. Promote — if accuracy improved, the adapter is snapshotted to output/best_adapter_<acc>/.
  5. Augment — Claude generates targeted training examples focused on the eval failures.
  6. Commit — git commit captures the full state: data, config, context files, and adapter snapshot.
  7. RepeatAGENT_CONTEXT.md and PLAN.md are updated, and the loop restarts.

Installation

From PyPI

uv pip install slm-forge
# or
pip install slm-forge

From source

git clone https://github.com/your-org/slm-pipeline
cd slm-pipeline
uv pip install -e .

Requirements

Requirement Notes
Python 3.10+ 3.11 recommended
CUDA GPU 16 GB+ VRAM recommended (24 GB for larger models)
Unsloth For fast LoRA training — pip install unsloth
ANTHROPIC_API_KEY For datagen and planning — set in environment

No GPU for experimentation? You can still run forge generate, forge audit, and forge status on CPU. Only forge train and forge eval need CUDA.


Project Structure

Running forge init my-task creates:

my-task/
├── forge.yaml              # All config: model, LoRA, training, eval, flywheel
├── BACKGROUND.md           # Domain knowledge for Claude datagen (edit this!)
├── mission.md              # Task description for Claude datagen
├── system_prompt.md        # Exact system prompt used at train + eval time
├── llm.txt                 # Briefing doc for any AI agent starting fresh
├── .gitignore              # Ignores output/ except best_adapter/ and best_score.json
│
├── dataset/
│   ├── seed.jsonl          # Your hand-crafted seed examples (optional)
│   ├── gold.jsonl          # LOCKED eval set — never train on this
│   └── canonical.jsonl     # Training data (forge generates + audits this)
│
└── output/                 # Created at runtime
    ├── adapter/            # Current training output
    ├── best_adapter_*/     # Snapshots of best adapters (one per improvement)
    ├── best_score.json     # Best accuracy achieved so far
    ├── experiments.jsonl   # Full log of every iteration
    ├── flywheel_heartbeat.json  # Live status for monitoring
    └── flywheel.log        # Verbose run log

Auto-created during flywheel:

├── AGENT_CONTEXT.md        # Auto-updated iteration history (scores, failures)
└── PLAN.md                 # Claude's current strategy and hypotheses

Key Files Explained

forge.yaml — All Configuration

The single source of truth for everything. Here is a fully-annotated example:

name: my-task              # project name (informational)

model:
  base: unsloth/functiongemma-270m-it  # HuggingFace model ID to fine-tune
  max_seq_len: 2048        # maximum token sequence length
  load_in_4bit: true       # 4-bit quantization (saves VRAM, recommended)

lora:
  r: 32                    # LoRA rank — higher = more capacity, slower training
  alpha: 32                # LoRA alpha — usually set equal to r
  dropout: 0.05            # LoRA dropout regularization
  target_modules:          # which linear layers to train (model-specific)
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

training:
  epochs_min: 2            # minimum epochs to run (even if loss converges early)
  epochs_max: 4            # maximum epochs before stopping
  batch_size: 4            # examples per gradient step
  gradient_accumulation_steps: 1  # effective batch = batch_size × this
  learning_rate: 5.0e-5    # AdamW learning rate
  warmup_ratio: 0.1        # fraction of steps for LR warmup
  weight_decay: 0.01       # L2 regularization
  lr_scheduler: cosine     # cosine | linear | constant
  seed: 42                 # random seed for reproducibility

eval:
  target_accuracy: 0.95    # stop the flywheel when this accuracy is reached
  scorer: json_cmd         # exact | json_cmd | custom
  max_new_tokens: 256      # max tokens for model output during eval
  forbidden_commands: []   # outputs that are always wrong (safety)

dataset:
  seed: dataset/seed.jsonl      # seed examples (copied into canonical on first run)
  gold: dataset/gold.jsonl      # locked eval set
  canonical: dataset/canonical.jsonl  # training data (grows each iteration)

promotion:
  hf_repo: null            # HuggingFace repo to push to (e.g. my-org/my-model)
  private: true            # push as private repo

flywheel:
  max_iterations: 10       # max autonomous iterations before stopping
  augment_per_failure: 20  # synthetic examples to generate per failure category
  datagen_model: claude-sonnet-4-6   # Claude model for data generation
  planner_model: claude-sonnet-4-6   # Claude model for experiment planning

compute:
  device: cuda:0           # which GPU to use (cuda:0, cuda:1, cpu)
  remote: null             # e.g. ssh://user@host for remote execution

BACKGROUND.md — Domain Knowledge for Datagen

This is the most important file for data quality. Claude reads it in full (never truncated) before generating every training example. Think of it as your prompt engineering for synthetic data.

See Writing Good BACKGROUND.md below.

What goes here:

  • The exact task description
  • The required output format with a complete example
  • Every edge case you know about (with correct answers)
  • Common failure patterns you've observed
  • What makes a good vs bad training example

What does NOT go here:

  • Iteration history, scores, or run logs — those belong in AGENT_CONTEXT.md (auto-managed)
  • Speculation or hypotheses — those belong in PLAN.md (auto-managed)

system_prompt.md — The Model's System Prompt

The exact system prompt that is injected at both training time and eval time. This is the single source of truth — train.py, eval.py, and datagen all load from this file.

Edit with care. Changing this file mid-training is a major disruption because the model was trained on a different prompt.

mission.md — Task Description for Claude

A natural language description of the task, fed to Claude when generating training data. Unlike BACKGROUND.md (which is highly structured), mission.md is free-form prose explaining why the task matters, what the inputs look like, and any constraints Claude should respect.

BACKGROUND.md vs AGENT_CONTEXT.md

BACKGROUND.md AGENT_CONTEXT.md
Author You (the human) forge (auto-managed)
Content Static domain knowledge Dynamic iteration history
Update frequency Rarely, manually After every eval
Truncated? Never Yes (keeps most recent)
Purpose Ground truth for datagen Context for planning

BACKGROUND.md should be written once and kept accurate. AGENT_CONTEXT.md is a running log of scores, failures, and hypotheses — never edit it manually.


CLI Reference

forge init <name>

Scaffold a new forge project.

forge init my-router
forge init my-router --dir /path/to/existing-dir
forge init my-router --from /path/to/old-project  # migrate from llamadrone format

forge generate

Generate synthetic training data via batched Claude API calls. Fires all batches concurrently, validates each example (assistant must be valid JSON, user must be non-empty), and deduplicates against existing canonical.jsonl.

forge generate                          # 500 examples, batch size 50
forge generate --n 200 --batch-size 25  # 200 examples in batches of 25
forge generate --n 1000 --output dataset/extra.jsonl
forge generate -p /path/to/project      # specify project dir explicitly

forge audit [--fix]

Claude reviews every example in canonical.jsonl for format errors, factual mistakes, and quality issues. With --fix, it rewrites bad examples in place.

forge audit              # print report, no changes
forge audit --fix        # fix errors automatically
forge audit --fix --min-severity warning  # fix warnings too

Audit report columns: index, severity (error/warning/info/ok), issue, suggested_fix.

forge train [--epochs N]

Train the LoRA adapter on dataset/canonical.jsonl. Output goes to output/adapter/.

forge train              # use epochs from forge.yaml
forge train --epochs 3   # override epoch count

forge eval

Evaluate the current adapter against dataset/gold.jsonl. Prints accuracy and writes failures to output/eval_failures.jsonl.

forge eval
forge eval -p /path/to/project

forge flywheel --iters N

Run the full autonomous loop. This is the main command.

forge flywheel                     # run until target_accuracy or max_iterations
forge flywheel --iters 5           # run exactly 5 iterations
forge flywheel --skip-train        # skip training on first iteration (eval existing adapter)
forge flywheel --iters 10 -p ./my-task

forge status

Show current project state: best accuracy, dataset sizes, last iteration, heartbeat status.

forge status
forge status -p /path/to/project

forge promote

Manually snapshot the current adapter to output/best_adapter_<timestamp>/. The flywheel does this automatically when accuracy improves.

forge promote

forge push

Upload the best adapter to HuggingFace (requires hf_repo in forge.yaml and HF_TOKEN).

forge push
HF_TOKEN=hf_xxx forge push

forge clean

Remove all augmented (flywheel-generated) examples from canonical.jsonl, keeping only your seed examples and hand-crafted data. Useful when starting a fresh training run.

forge clean
forge clean --keep-seed   # keep seed.jsonl examples only

forge patch "description"

One-shot targeted data generation based on a plain-English description of what to generate. Useful for quickly patching gaps without running the full flywheel.

forge patch "more examples where the input contains coordinates as lat/lon floats"
forge patch "edge cases where the user says 'abort' vs 'cancel'" --n 50

forge augment

Generate training data targeted at the most recent eval failures (reads output/eval_failures.jsonl). The flywheel does this automatically; use this command for manual augmentation.

forge augment
forge augment --n 100    # generate 100 examples

The Autonomous Flywheel

The flywheel is forge's key innovation. Each iteration, Claude acts as an ML experiment planner with full context of everything that has been tried.

What Claude Reads Each Iteration

  1. BACKGROUND.md — full domain knowledge (never truncated)
  2. AGENT_CONTEXT.md — last ~5000 chars of iteration history (scores, failure taxonomy, what worked)
  3. PLAN.md — current strategy, hypotheses queue, what has worked/failed
  4. Live failures — the exact inputs, expected outputs, and actual outputs from the latest eval

What Claude Returns: ExperimentPlan

{
  "hypothesis": "The model confuses 'hover' with 'loiter'. Generating 50 targeted examples should fix this.",
  "config_patches": {
    "training": {"learning_rate": 3e-5, "epochs_max": 5}
  },
  "forge_yaml_patches": {
    "lora": {"r": 64}
  },
  "background_additions": "## Discovered Pattern\n\n'hover' means maintain altitude in place. 'loiter' means circle a point. These are different commands.",
  "augment_focus": "Examples distinguishing hover vs loiter commands with varied phrasing",
  "augment_n": 60
}
  • hypothesis — plain English description of what we think will help
  • config_patches — temporary config overrides for this iteration only (not written to disk)
  • forge_yaml_patches — permanent config changes written to forge.yaml on disk
  • background_additions — new domain knowledge appended to BACKGROUND.md
  • augment_focus — what kind of examples to generate this iteration
  • augment_n — how many examples to generate

After Each Iteration

  1. AGENT_CONTEXT.md is updated with the new score, hypothesis result, and failure counts
  2. PLAN.md is updated with the new strategy and hypotheses queue
  3. Git commit: forge iter N: 87.3% — hypothesis text (first 60 chars)
  4. Heartbeat written to output/flywheel_heartbeat.json

Heartbeat File

{
  "iteration": 4,
  "status": "running",
  "accuracy": 0.873,
  "timestamp": "2025-01-15T14:23:01Z"
}

Status values: running, completed, error. Check this file to monitor long runs without attaching to the process.


forge.yaml Reference

Field Type Default Description
model.base string unsloth/functiongemma-270m-it HuggingFace model ID
model.max_seq_len int 2048 Max token sequence length
model.load_in_4bit bool true 4-bit quantization
lora.r int 32 LoRA rank
lora.alpha int 32 LoRA alpha (usually = r)
lora.dropout float 0.05 LoRA dropout
lora.target_modules list [q_proj, k_proj, ...] Layers to train
training.epochs_min int 2 Minimum epochs
training.epochs_max int 4 Maximum epochs
training.batch_size int 4 Examples per gradient step
training.gradient_accumulation_steps int 1 Effective batch multiplier
training.learning_rate float 5.0e-5 AdamW learning rate
training.warmup_ratio float 0.1 Fraction of steps for LR warmup
training.weight_decay float 0.01 L2 regularization
training.lr_scheduler string cosine LR schedule (cosine/linear/constant)
training.seed int 42 Random seed
eval.target_accuracy float 0.95 Stop flywheel when reached
eval.scorer string json_cmd Scoring method (see below)
eval.max_new_tokens int 256 Max output tokens during eval
eval.forbidden_commands list [] Outputs that are always wrong
dataset.seed path dataset/seed.jsonl Seed examples
dataset.gold path dataset/gold.jsonl Eval set (locked)
dataset.canonical path dataset/canonical.jsonl Training data
promotion.hf_repo string null HuggingFace push target
promotion.private bool true Push as private repo
flywheel.max_iterations int 10 Max autonomous iterations
flywheel.augment_per_failure int 20 Examples per failure category
flywheel.datagen_model string claude-sonnet-4-6 Claude model for datagen
flywheel.planner_model string claude-sonnet-4-6 Claude model for planning
compute.device string cuda:0 Training device
compute.remote string null SSH remote (e.g. ssh://user@host)

Scorer types:

Scorer Description
exact Exact string match (case-sensitive)
json_cmd Parses JSON, checks cmd field only
custom Calls scorer.py in project root with (expected, actual) -> float

Writing Good BACKGROUND.md

BACKGROUND.md is the most important document in your project. Claude reads it in full before generating every training example. The better this file, the more realistic and diverse your training data.

Here is a complete example for a drone command routing task:

# Background — drone-router

## Task

Route incoming natural language user messages to one of 8 drone commands.
Return a structured JSON object with the command name, arguments, and confidence.

Commands (exact names — no synonyms):
- `takeoff` — lift off from current position; args: {altitude_m: float}
- `land` — descend and land; args: {}
- `goto_waypoint` — fly to a location; args: {grid: str} OR {lat: float, lon: float}
- `rtl` — return to launch point; args: {}
- `hover` — maintain current position and altitude; args: {duration_s: int | null}
- `loiter` — circle a point; args: {radius_m: float, duration_s: int | null}
- `ascend` — increase altitude; args: {delta_m: float}
- `descend` — decrease altitude; args: {delta_m: float}
- `unknown` — input doesn't map to a valid command; args: {}

## Output Format

Always return valid JSON. Never return plain text, never include prose before or after.

Schema:
{
  "cmd": "<command_name>",
  "args": { ... },
  "confidence": <0.0–1.0>
}

Correct example:
Input:  "Fly to waypoint B4"
Output: {"cmd": "goto_waypoint", "args": {"grid": "B4"}, "confidence": 0.97}

Wrong (missing args key):
Output: {"cmd": "goto_waypoint", "confidence": 0.97}

Wrong (invalid cmd name):
Output: {"cmd": "fly_to", "args": {"grid": "B4"}, "confidence": 0.97}

Wrong (plain text):
Output: The drone should go to B4.

## Edge Cases

- "Return home" / "go home" / "RTL" → cmd:rtl (NOT goto_waypoint)
- "Hover" / "hold position" / "stay here" → cmd:hover (NOT loiter)
- "Circle the area" / "loiter over" → cmd:loiter (NOT hover)
- Coordinates can be grid refs ("B4") or lat/lon: "37.7749,-122.4194"
- "Go up 10 meters" → cmd:ascend, args.delta_m:10.0
- "Go up" with no distance → cmd:ascend, args.delta_m:5.0 (default)
- Empty or gibberish input → cmd:unknown, confidence:0.0
- Multiple commands in one message → pick the primary intent
- "Land at base" vs "Land now" → both cmd:land, args:{} (base is default)

## Common Failure Patterns

- Model confuses `hover` and `loiter` — they are DIFFERENT commands
- Drops `args` key entirely when no args needed — must always include args: {}
- Outputs plain text when input contains a question mark
- Hallucinates command names: "ascend_fast", "fly_to", "go_to" — ALL INVALID
- Omits `confidence` field when input is ambiguous
- Returns confidence:1.0 for ambiguous inputs (should be 0.5–0.7)

## What Good Examples Look Like

Input variety:
- Very short: "up", "land", "B4"
- Casual: "take it up a bit", "bring it back"
- Formal: "initiate return-to-launch sequence"
- Ambiguous: "go higher" (ascend), "stay there" (hover)
- With noise: "uh, like, go to... B4 please?"

Output correctness:
- cmd is always from the exact allowed list (case-sensitive)
- args always present, even if {}
- confidence reflects actual certainty (not always 1.0)

Difficulty range:
- 30% easy (unambiguous, direct)
- 50% medium (some inference needed)
- 20% hard (edge cases, ambiguity, unusual phrasing)

Eval Set Design

The eval set (dataset/gold.jsonl) is the only objective measure of progress. Treat it like a test suite.

Golden rules:

  1. Never train on it. forge enforces this — gold examples are never included in canonical.jsonl.
  2. Lock it early. Create your gold set before any training and never modify it. If you keep adding to it, your accuracy numbers aren't comparable across iterations.
  3. Make it representative. Your gold set should cover all difficulty levels and edge cases, not just easy examples.
  4. Aim for 50–200 examples. Too few → noisy accuracy. Too many → slow eval loop.

How to create a gold set:

# Start with your best hand-crafted examples
# Format: one JSON per line, messages format
echo '{"messages":[{"role":"user","content":"Fly to B4"},{"role":"assistant","content":"{\"cmd\":\"goto_waypoint\",\"args\":{\"grid\":\"B4\"},\"confidence\":0.97}"}]}' >> dataset/gold.jsonl

# Or use forge generate to generate candidates, then hand-review
forge generate --n 200 --output /tmp/candidates.jsonl
# Review /tmp/candidates.jsonl, pick the ones you trust, add to gold.jsonl

Format:

{"messages": [{"role": "user", "content": "<input>"}, {"role": "assistant", "content": "<expected output>"}]}

Monitoring Long Runs

A flywheel run can take hours. Here's how to monitor without babysitting it.

Heartbeat File

watch -n 10 cat output/flywheel_heartbeat.json
{
  "iteration": 7,
  "status": "running",
  "accuracy": 0.912,
  "timestamp": "2025-01-15T14:23:01Z"
}

Experiment Log

output/experiments.jsonl — one record per iteration with full metadata:

# Show accuracy progression
cat output/experiments.jsonl | python3 -c "
import sys, json
for line in sys.stdin:
    r = json.loads(line)
    print(f\"iter {r['iteration']}: {r['accuracy']:.1%} — {r['hypothesis'][:60]}\")
"

Agent Context

AGENT_CONTEXT.md — human-readable log of scores, failure patterns, and hypotheses. Updated after every eval.

tail -100 AGENT_CONTEXT.md

Git Log

Every iteration is committed. git log --oneline shows the full history at a glance:

a1b2c3d forge iter 8: 93.2% — Increasing LoRA rank to 64 for better capacity
b2c3d4e forge iter 7: 91.5% — Targeting hover/loiter confusion with 60 new examples
c3d4e5f forge iter 6: 88.1% — Reducing LR after oscillation detected
...

Crash Recovery

  1. Check output/flywheel_heartbeat.jsonstatus will be "error" with error field
  2. Check output/flywheel.log for the traceback
  3. Run forge status to see last known accuracy
  4. Fix the issue (bad data, OOM, etc.)
  5. Resume:
    forge flywheel --iters 5 --skip-train  # eval existing adapter first
    # or
    forge flywheel --iters 5               # retrain from scratch
    

Examples

forge-math-demo Walkthrough

A working example lives in examples/forge-math-demo/. It trains a model to answer arithmetic questions as structured JSON.

cd examples/forge-math-demo

# Look at what's pre-populated
cat BACKGROUND.md        # describes the math→JSON task
cat dataset/gold.jsonl   # 20 held-out eval examples

# Run the flywheel (no GPU needed for demo — uses cpu + tiny model)
forge flywheel --iters 3

# Check what happened
forge status
git log --oneline
cat output/experiments.jsonl | python3 -m json.tool | head -40

Expected output after 3 iterations:

forge status
─────────────────────────────────────
  Project:    forge-math-demo
  Best score: 0.850 (iter 2)
  Last score: 0.850
  Dataset:    142 training / 20 gold examples
  Iterations: 3 / 10
─────────────────────────────────────

Domain Knowledge Workflow (Pretrain → Probe → Fine-tune)

For tasks that require deep domain knowledge — medical, legal, scientific, or proprietary — you can pretrain the base model on raw documents before task fine-tuning. This "smart base" absorbs domain knowledge first, making subsequent LoRA fine-tuning much more effective.

┌──────────────────────────────────────────────────────────────┐
│              Domain Knowledge Acquisition Pipeline            │
│                                                              │
│  ┌──────────┐   ┌───────────┐   ┌──────────┐   ┌─────────┐  │
│  │ probe    │   │ pretrain  │   │  probe   │   │flywheel │  │
│  │ --tag pre│──▶│ (corpus)  │──▶│ --tag    │──▶│(on smart│  │
│  │ baseline │   │ + merge   │   │ post     │   │  base)  │  │
│  └──────────┘   └───────────┘   └──────────┘   └─────────┘  │
│       │              │               │                        │
│  Scores 30 Qs   Causal LM on    Scores same    Task LoRA on  │
│  before domain  raw text docs   30 Qs after    merged base   │
│  training       (no chat fmt)   training       model         │
└──────────────────────────────────────────────────────────────┘

Step-by-step

1. Create probe questions (dataset/probe.jsonl):

{"question": "What is the half-life of carbon-14?", "ideal": "Approximately 5,730 years."}
{"question": "What does HIPAA regulate?", "ideal": "The privacy and security of protected health information (PHI)."}

30 questions is a good number — enough to measure meaningful delta, fast enough to run twice.

2. Baseline probe (run once — permanent record):

forge probe --tag pre

3. Add domain corpus to dataset/corpus/:

dataset/corpus/
├── technical_manual.pdf.txt    # convert PDFs externally
├── domain_docs.md
└── training_data.jsonl         # {"text": "..."}

4. Pretrain (causal LM on raw text, auto-merges to 16-bit):

forge pretrain                  # uses corpus_dir from forge.yaml
forge pretrain --epochs 2       # more epochs for smaller corpora

Output: output/pretrain_merged/ — this is your new base model.

5. Post-pretrain probe:

forge probe --tag post --adapter output/pretrain_merged
forge probe --compare           # shows delta table

6. Update forge.yaml to use the merged model:

model:
  base: output/pretrain_merged   # was: unsloth/functiongemma-270m-it

7. Run task fine-tuning on the smarter base:

forge flywheel --iters 10

Key rules

  • Never fine-tune LoRA on top of the pretrain LoRA adapter — always merge first (LoRA-on-LoRA = catastrophic forgetting)
  • The merge uses Unsloth's save_pretrained_merged with merged_16bit — avoids the Gemma3 tokenizer breakage from AutoModelForCausalLM
  • Probe questions must stay fixed after the pre-probe run — changing them breaks the comparison
  • Pretrain LR (2e-5) is intentionally lower than fine-tune LR (5e-5) — domain absorption needs gentle updates

Architecture

Each module in forge/core/ has a single responsibility:

Module Description
forge/core/config.py Load and validate forge.yaml. Handles defaults, field merging, and project root discovery by walking up from cwd.
forge/core/trainer.py LoRA fine-tuning via Unsloth. Wraps FastLanguageModel, applies training config, saves adapter to output/adapter/.
forge/core/eval_runner.py Run model inference on gold.jsonl, score each example via the configured scorer, return accuracy + failure list.
forge/core/scorer.py Scoring strategies: exact (string match), json_cmd (parse JSON, check cmd field), custom (call scorer.py).
forge/core/augmentor.py Build Claude prompts from BACKGROUND.md + AGENT_CONTEXT.md + failures, call Claude API, parse and append to canonical.jsonl.
forge/core/planner.py Maintain AGENT_CONTEXT.md and PLAN.md. Call Claude with full context to get ExperimentPlan. Apply forge_yaml_patches to disk.
forge/core/experiment_planner.py ExperimentPlan dataclass and JSON parsing/validation.
forge/core/dataset.py JSONL loading, deduplication, seed merging, train/gold split helpers.
forge/core/context.py Read/write AGENT_CONTEXT.md: append iteration summaries, truncate to MAX_CONTEXT_CHARS.

CLI layer (forge/commands/): thin Click commands that call into forge/core/. Each command is a separate file for easy testing and extension.


Contributing

git clone https://github.com/your-org/slm-pipeline
cd slm-pipeline
uv pip install -e ".[dev]"
pytest tests/ -q           # 100+ tests, all mocked (no GPU, no API key needed)

Contributions welcome. Please add tests for any new command or core module change.


License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llamafarm_forge-0.3.0-py3-none-any.whl (121.0 kB view details)

Uploaded Python 3

File details

Details for the file llamafarm_forge-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llamafarm_forge-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c2e914df9751f718648a866a57360ac60c165af2c08287706910fad4b51efd2
MD5 f2cebb5fd659812db4d4b8ec80301cb7
BLAKE2b-256 f66a5a153f5a0f75c073acb9f7a2b8ce667a7a4694d188f1ae13fb6a932e08d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page