Per-grammar-role loss decomposition for fine-tuned structured JSON output
Project description
slotloss
Per-grammar-role loss decomposition for fine-tuned structured JSON output.
Fine-tuning your LLM for JSON output? Your aggregate metrics might be lying to you.
slotloss decomposes fine-tuning loss by grammar role (structural tokens, schema keys, enum values, booleans, free text) and compares baseline vs fine-tuned performance. It reveals per-role regressions that aggregate metrics hide.
The Problem
Standard LoRA fine-tuning + grammar-constrained decoding produces valid JSON at all scales. Aggregate loss improves. Everything looks great.
But at 32B parameters, fine-tuning can degrade specific grammar roles while aggregate loss improves:
slotloss: Per-Grammar-Role Loss Report
======================================================================
Role Baseline Fine-tuned Change Status
----------------------------------------------------------------------
STRUCTURAL 5.3298 0.0002 -100.0% OK (-100%)
KEY 0.4736 0.0001 -100.0% OK (-100%)
ENUM_VALUE 0.3313 0.3029 -8.6% OK (-9%)
BOOLEAN 0.4568 1.0498 +129.8% !! REGRESSION (+130%)
FREE_TEXT 1.3287 0.6289 -52.7% OK (-53%)
----------------------------------------------------------------------
TOTAL 0.5544 0.1742 -68.6%
WARNING: 1 grammar role(s) REGRESSED after fine-tuning:
BOOLEAN: 0.4568 -> 1.0498 (+130%)
Your model may be memorizing majority values for constrained fields.
Aggregate loss improved 69%. BOOLEAN prediction got 130% worse. Without slotloss, you'd never know.
Install
pip install slotloss
Usage
Command Line
# Compare baseline vs fine-tuned
slotloss --model Qwen/Qwen2.5-7B-Instruct \
--checkpoint my_lora/ \
--schema schema.json \
--data test.jsonl \
--device cuda
# Baseline only
slotloss --model Qwen/Qwen2.5-7B-Instruct \
--schema schema.json \
--data test.jsonl
# Save JSON report
slotloss --model Qwen/Qwen2.5-7B-Instruct \
--checkpoint my_lora/ \
--schema schema.json \
--data test.jsonl \
--output report.json
Exit code is 1 if regressions are detected, 0 otherwise. Use in CI/CD pipelines.
Python API
from slotloss import analyze
report = analyze(
model_name="Qwen/Qwen2.5-7B-Instruct",
checkpoint="my_lora/",
schema="schema.json",
data="test.jsonl",
device="cuda",
)
print(report) # formatted report with regression warnings
# Programmatic access
for comp in report.comparisons:
print(f"{comp.role}: {comp.baseline_loss:.4f} -> {comp.finetuned_loss:.4f} ({comp.status})")
if report.regressions:
print(f"REGRESSIONS: {[r.role for r in report.regressions]}")
Low-Level API
from slotloss import GrammarRole, assign_grammar_roles
# Assign grammar roles to any JSON string
roles = assign_grammar_roles('{"city": "NYC", "cuisine": "Italian"}', schema)
# [STRUCTURAL, QUOTE, KEY, KEY, KEY, KEY, QUOTE, STRUCTURAL, ...]
Data Format
Test data is JSONL with prompt and target_json fields:
{"prompt": "Extract restaurant info...", "target_json": "{\"city\": \"NYC\"}"}
Schema is standard JSON Schema:
{
"type": "object",
"properties": {
"city": {"type": "string"},
"cuisine": {"type": "string", "enum": ["Mexican", "Italian"]},
"has_wifi": {"type": "string", "enum": ["True", "False"]}
}
}
Grammar Roles
| Role | Description | Examples |
|---|---|---|
| STRUCTURAL | JSON syntax | { } [ ] : , |
| QUOTE | String delimiters | " |
| KEY | Object key characters | city, cuisine |
| ENUM_VALUE | Categorical values | Italian, Economy |
| BOOLEAN | Boolean strings | True, False |
| NUMBER | Numeric characters | 42, 3.14 |
| FREE_TEXT | Non-categorical content | names, addresses |
| WHITESPACE | Formatting | spaces, newlines |
Why Regressions Happen
Fine-tuning on small datasets biases the model toward training-set patterns. Structural tokens (trivial decisions) improve massively, dominating the aggregate gradient. Constrained fields like booleans and enums (genuine decisions) can overfit to majority values. Aggregate loss improves because the large gains on trivial roles outweigh the regression on substantive roles.
The regression emerges at scale: larger pretrained models have stronger existing competencies that fine-tuning can disrupt. The better the base model already is at a grammar role, the more fine-tuning has to lose.
Paper
Baldwin (2026), "Valid JSON, Wrong Answer: Fine-Tuning Degrades Grammar-Role Performance at Scale Despite Improved Aggregate Loss."
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slotloss-0.1.0.tar.gz.
File metadata
- Download URL: slotloss-0.1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6abc9c3d7165cca3e23810db118d37242eeb4c139c1a2654179231bafd8f2a15
|
|
| MD5 |
770bf6dc3f91bd936e278a42068222a8
|
|
| BLAKE2b-256 |
fd61dd670c06d221ebb3c22bb56623fb17a869963214ebfaa68c593efc896040
|
File details
Details for the file slotloss-0.1.0-py3-none-any.whl.
File metadata
- Download URL: slotloss-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54f9380678613518b7d5bbfc1b52643c8d92da9b7f8214ce7bd92b969c16fa51
|
|
| MD5 |
8730fe1210ad644415707c3322e65551
|
|
| BLAKE2b-256 |
cfb76282a51356da76da0c9dd743e7c14e8313544a90ef7f580af5a5d04c5dde
|