Skip to main content

Per-grammar-role loss decomposition for fine-tuned structured JSON output

Project description

slotloss

Per-grammar-role loss decomposition for fine-tuned structured JSON output.

Fine-tuning your LLM for JSON output? Your aggregate metrics might be lying to you.

slotloss decomposes fine-tuning loss by grammar role (structural tokens, schema keys, enum values, booleans, free text) and compares baseline vs fine-tuned performance. It reveals per-role regressions that aggregate metrics hide.

The Problem

Standard LoRA fine-tuning + grammar-constrained decoding produces valid JSON at all scales. Aggregate loss improves. Everything looks great.

But at 32B parameters, fine-tuning can degrade specific grammar roles while aggregate loss improves:

slotloss: Per-Grammar-Role Loss Report
======================================================================

Role             Baseline   Fine-tuned     Change  Status
----------------------------------------------------------------------
STRUCTURAL         5.3298       0.0002     -100.0%  OK (-100%)
KEY                0.4736       0.0001     -100.0%  OK (-100%)
ENUM_VALUE         0.3313       0.3029       -8.6%  OK (-9%)
BOOLEAN            0.4568       1.0498     +129.8%  !! REGRESSION (+130%)
FREE_TEXT          1.3287       0.6289      -52.7%  OK (-53%)
----------------------------------------------------------------------
TOTAL              0.5544       0.1742      -68.6%

WARNING: 1 grammar role(s) REGRESSED after fine-tuning:
  BOOLEAN: 0.4568 -> 1.0498 (+130%)

Your model may be memorizing majority values for constrained fields.

Aggregate loss improved 69%. BOOLEAN prediction got 130% worse. Without slotloss, you'd never know.

Install

pip install slotloss

Usage

Command Line

# Compare baseline vs fine-tuned
slotloss --model Qwen/Qwen2.5-7B-Instruct \
    --checkpoint my_lora/ \
    --schema schema.json \
    --data test.jsonl \
    --device cuda

# Baseline only
slotloss --model Qwen/Qwen2.5-7B-Instruct \
    --schema schema.json \
    --data test.jsonl

# Save JSON report
slotloss --model Qwen/Qwen2.5-7B-Instruct \
    --checkpoint my_lora/ \
    --schema schema.json \
    --data test.jsonl \
    --output report.json

Exit code is 1 if regressions are detected, 0 otherwise. Use in CI/CD pipelines.

Python API

from slotloss import analyze

report = analyze(
    model_name="Qwen/Qwen2.5-7B-Instruct",
    checkpoint="my_lora/",
    schema="schema.json",
    data="test.jsonl",
    device="cuda",
)

print(report)  # formatted report with regression warnings

# Programmatic access
for comp in report.comparisons:
    print(f"{comp.role}: {comp.baseline_loss:.4f} -> {comp.finetuned_loss:.4f} ({comp.status})")

if report.regressions:
    print(f"REGRESSIONS: {[r.role for r in report.regressions]}")

Low-Level API

from slotloss import GrammarRole, assign_grammar_roles

# Assign grammar roles to any JSON string
roles = assign_grammar_roles('{"city": "NYC", "cuisine": "Italian"}', schema)
# [STRUCTURAL, QUOTE, KEY, KEY, KEY, KEY, QUOTE, STRUCTURAL, ...]

Data Format

Test data is JSONL with prompt and target_json fields:

{"prompt": "Extract restaurant info...", "target_json": "{\"city\": \"NYC\"}"}

Schema is standard JSON Schema:

{
  "type": "object",
  "properties": {
    "city": {"type": "string"},
    "cuisine": {"type": "string", "enum": ["Mexican", "Italian"]},
    "has_wifi": {"type": "string", "enum": ["True", "False"]}
  }
}

Grammar Roles

Role Description Examples
STRUCTURAL JSON syntax { } [ ] : ,
QUOTE String delimiters "
KEY Object key characters city, cuisine
ENUM_VALUE Categorical values Italian, Economy
BOOLEAN Boolean strings True, False
NUMBER Numeric characters 42, 3.14
FREE_TEXT Non-categorical content names, addresses
WHITESPACE Formatting spaces, newlines

Why Regressions Happen

Fine-tuning on small datasets biases the model toward training-set patterns. Structural tokens (trivial decisions) improve massively, dominating the aggregate gradient. Constrained fields like booleans and enums (genuine decisions) can overfit to majority values. Aggregate loss improves because the large gains on trivial roles outweigh the regression on substantive roles.

The regression emerges at scale: larger pretrained models have stronger existing competencies that fine-tuning can disrupt. The better the base model already is at a grammar role, the more fine-tuning has to lose.

Paper

Baldwin (2026), "Valid JSON, Wrong Answer: Fine-Tuning Degrades Grammar-Role Performance at Scale Despite Improved Aggregate Loss."

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slotloss-0.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slotloss-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file slotloss-0.1.0.tar.gz.

File metadata

  • Download URL: slotloss-0.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for slotloss-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6abc9c3d7165cca3e23810db118d37242eeb4c139c1a2654179231bafd8f2a15
MD5 770bf6dc3f91bd936e278a42068222a8
BLAKE2b-256 fd61dd670c06d221ebb3c22bb56623fb17a869963214ebfaa68c593efc896040

See more details on using hashes here.

File details

Details for the file slotloss-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: slotloss-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for slotloss-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54f9380678613518b7d5bbfc1b52643c8d92da9b7f8214ce7bd92b969c16fa51
MD5 8730fe1210ad644415707c3322e65551
BLAKE2b-256 cfb76282a51356da76da0c9dd743e7c14e8313544a90ef7f580af5a5d04c5dde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page