Skip to main content

Restore heading hierarchy in markdown documents using a fine-tuned Qwen3-0.6B model

Project description

md-reheader

Restore heading hierarchy in markdown documents with a fine-tuned 0.6B LLM.

PyPI Python 3.12+ Apache 2.0 HuggingFace Model HuggingFace Dataset GitHub stars


The problem

PDF-to-markdown tools like MinerU, Docling, and Marker do great text extraction — then collapse your document structure. Every heading becomes # or ##. TOCs break. RAG chunking breaks. Navigation breaks.

md-reheader fixes it. A 0.6B-parameter Qwen3 fine-tune reads the document and predicts the correct H1–H6 level for every heading in a single forward pass.

Source PDF → md-reheader → hierarchy restored vs. flat PDF-parser output


Quick start

CLI

pip install md-reheader
rehead --input flat.md --output fixed.md

Auto-detects CUDA. Use --cpu or --gpu to override. Omit --output to stream to stdout — pipe-friendly for integration with other CLIs.

rehead -i flat.md | tee fixed.md               # pipe
rehead -i flat.md --gpu -o out/fixed.md        # creates nested dirs
rehead -i flat.md --force -o existing.md       # overwrite
rehead --help                                   # all flags

Python API

from md_reheader.inference.predict import load_model, reheader_document

model, tokenizer = load_model("joelbarmettler/md-reheader")

flat = open("document.md").read()
fixed = reheader_document(flat, model, tokenizer)

The package handles preprocessing (flattening + body stripping) and postprocessing (applying predicted levels back to the original document) automatically.

Direct transformers usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("joelbarmettler/md-reheader")
model = AutoModelForCausalLM.from_pretrained(
    "joelbarmettler/md-reheader",
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a markdown document structure expert. Given a markdown document with incorrect or flattened heading levels, output each heading with its correct markdown prefix (# for level 1, ## for level 2, etc.), one per line."},
    {"role": "user", "content": "# Introduction\n\nSome text...\n\n# Background\n\nMore text...\n\n# Methods"},
]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))

Important: pass enable_thinking=False to apply_chat_template. Without it, the model enters a repetition loop because training used the non-thinking chat format.

Self-host with vLLM

pip install vllm
vllm serve joelbarmettler/md-reheader --dtype bfloat16 --max-model-len 8192

Higher throughput than raw transformers and drop-in OpenAI-compatible clients. On <10 GB cards add --enforce-eager --gpu-memory-utilization 0.70 to skip CUDA-graph allocations.

Remote inference (any OpenAI-compatible endpoint)

Once a server is running, use md-reheader as a thin client — no local weights needed.

rehead -i flat.md -o fixed.md --endpoint http://localhost:8000/v1
rehead -i flat.md -o fixed.md --endpoint https://api.example.com/v1 --api-key sk-xxx
# or set MD_REHEADER_API_KEY in the environment
from md_reheader.inference.remote import reheader_document_remote

fixed = reheader_document_remote(
    open("flat.md").read(),
    endpoint="http://localhost:8000/v1",
    model="joelbarmettler/md-reheader",
)

Identical output to local inference. Preprocessing (flatten + strip) happens client-side; the server just runs the chat completion with chat_template_kwargs={"enable_thinking": false} to match training.


How it works

flat markdown  ──►  flatten headings to #  ──►  strip body to 128+128 tokens
                                                         │
                                                         ▼
        restored markdown  ◄──  apply predicted levels  ◄── Qwen3-0.6B (fine-tuned)
  1. Extract headings with markdown-it-py — correctly skips code blocks.
  2. Flatten every heading to # — the model ignores input levels.
  3. Strip each section's body to its first 128 + last 128 tokens — preserves structural cues, kills context bloat.
  4. Qwen3-0.6B predicts the correct # prefix per heading.
  5. Levels get mapped back to the original document.

Evaluation

Benchmarked on 7,321 held-out documents from GitHub markdown and Wikipedia.

Metric All-H1 baseline Heuristic md-reheader
Exact match 0.0% 14.5% 56.1%
Per-heading accuracy 13.1% 49.1% 80.6%
Hierarchy preservation 61.3% 68.6% 91.0%
Mean absolute error 1.38 0.62 0.22

Per-level accuracy

H1 H2 H3 H4 H5 H6
Accuracy 77% 85% 78% 68% 45% 50%

H1–H3 land in the 77–85% band; H5/H6 drop but still beat baselines. Most deep-level errors are off-by-one — the relative structure survives.

By document depth

Max depth Exact match Per-heading accuracy Hierarchy
Depth 2 83% 91% 95%
Depth 3 54% 82% 90%
Depth 4 32% 70% 88%
Depth 5-6 33% 65% 89%

By source

Source Exact match Per-heading accuracy
GitHub markdown 49.5% 74.0%
Wikipedia 71.3% 95.5%

Speed

Document size RTX 4090 (BF16) CPU (fp32)
< 1k tokens 0.4s 5s
1k–2k tokens 0.8s 10s
2k–4k tokens 1.4s ~20s
4k–8k tokens 3.4s ~60s

Documents longer than ~8k tokens (after stripping) are truncated from the tail.


Limitations

  • Deep nesting (H5/H6) — accuracy drops to 45–50%. Relative structure is preserved; absolute depth gets compressed by 1–2 levels.
  • Ambiguous structure — heading levels are subjective. The model learns common conventions; it can't resolve genuine ambiguity.
  • Long documents — >8k tokens (after stripping) get truncated. Headings past the cutoff retain their input levels.
  • English-centric — trained primarily on English content.

Reproducing training

git clone https://github.com/joelbarmettlerUZH/md-reheader.git
cd md-reheader

uv sync --extra train    # install training dependencies
make download            # download raw data (~150k documents)
make prepare             # strip, flatten, oversample, format
make train               # train on 2x GPU with Axolotl
make eval                # evaluate on test set

The model is a fine-tune of Qwen/Qwen3-0.6B trained on ~197k markdown documents:

  • codeparrot/github-code — ~105k markdown files from GitHub repositories
  • euirim/goodwiki — ~45k Wikipedia articles
  • Deep documents (depth 4+) oversampled 2–8× for class balance

Trained with Axolotl on 2× RTX 4090 using DDP, BF16, 8k sequence length with sample packing.


License

Code and model weights: Apache 2.0. Training data includes Wikipedia content (CC BY-SA 4.0) and GitHub repositories (various open-source licenses).


Citation

@software{barmettler2026mdreheader,
  author = {Barmettler, Joel},
  title  = {md-reheader: Restoring Heading Hierarchy in Markdown Documents},
  year   = {2026},
  url    = {https://github.com/joelbarmettlerUZH/md-reheader}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

md_reheader-0.2.2.tar.gz (440.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

md_reheader-0.2.2-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file md_reheader-0.2.2.tar.gz.

File metadata

  • Download URL: md_reheader-0.2.2.tar.gz
  • Upload date:
  • Size: 440.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for md_reheader-0.2.2.tar.gz
Algorithm Hash digest
SHA256 3a0585635158f692ae34a587f085af56fed5ce9c7c9c1809a8356834e7aee4dc
MD5 0f9a01fe7c2f2a483c7e853d297d8a66
BLAKE2b-256 466b7ab0c024e87429fc8f526181641384e36dbcbe69f729ea8cd613785573ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for md_reheader-0.2.2.tar.gz:

Publisher: release.yml on joelbarmettlerUZH/md-reheader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file md_reheader-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: md_reheader-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for md_reheader-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7b7af9d334e9d1dbb0bd9915eec2b1735417dd5161f9c1040d01355e90a90055
MD5 8521855955937400c8ebb337ff0be331
BLAKE2b-256 da90add26bfcb7c9686150b32e5f03c2faebfe716362cb27b467f1c3d0fb36ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for md_reheader-0.2.2-py3-none-any.whl:

Publisher: release.yml on joelbarmettlerUZH/md-reheader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page