Restore heading hierarchy in markdown documents using a fine-tuned Qwen3-0.6B model

These details have not been verified by PyPI

Project links

Project description

md-reheader

Restore heading hierarchy in markdown documents using a fine-tuned 0.6B parameter LLM.

PDF-to-markdown tools like MinerU, Docling, and Marker often flatten heading structure. You get this:

# API Reference
# Authentication
# Endpoints
# Users
# List Users
# Get User by ID
# Projects
# List Projects
# Error Handling

md-reheader restores the correct hierarchy:

# API Reference
## Authentication
## Endpoints
### Users
#### List Users
#### Get User by ID
### Projects
#### List Projects
## Error Handling

Installation

pip install md-reheader

Requires Python 3.12+ and PyTorch. Works on both CPU and GPU.

Quick Start

From a file

from pathlib import Path
from md_reheader.inference.predict import load_model, reheader_document

model, tokenizer = load_model("joelbarmettlerUZH/md-reheader")

markdown = Path("document.md").read_text()
fixed = reheader_document(markdown, model, tokenizer)
Path("document_fixed.md").write_text(fixed)

From a string

from md_reheader.inference.predict import load_model, reheader_document

model, tokenizer = load_model("joelbarmettlerUZH/md-reheader")

flat_markdown = """\
# Introduction
# Background
# Related Work
# Methods
# Data Collection
# Preprocessing
# Model Architecture
# Results
# Discussion
# Conclusion
"""

fixed = reheader_document(flat_markdown, model, tokenizer)
print(fixed)
# # Introduction
# ## Background
# ### Related Work
# ## Methods
# ### Data Collection
# ### Preprocessing
# ### Model Architecture
# ## Results
# ## Discussion
# ## Conclusion

Post-processing MinerU / Docling output

# After running MinerU or Docling on a PDF:
from md_reheader.inference.predict import load_model, reheader_document

model, tokenizer = load_model("joelbarmettlerUZH/md-reheader")

# MinerU outputs markdown with flat headings
mineru_output = open("output/paper.md").read()

# Fix the heading hierarchy
fixed = reheader_document(mineru_output, model, tokenizer)

with open("output/paper_fixed.md", "w") as f:
    f.write(fixed)

GPU vs CPU

# GPU (recommended for batch processing)
model, tokenizer = load_model("joelbarmettlerUZH/md-reheader", device="cuda")

# CPU (no GPU required, slower)
model, tokenizer = load_model("joelbarmettlerUZH/md-reheader", device="cpu")

Speed

Benchmarked on a single NVIDIA RTX 4090 (BF16) and CPU (float32):

Document size	GPU (RTX 4090)	CPU
< 1k tokens	0.4s	5s
1k-2k tokens	0.8s	10s
2k-4k tokens	1.4s	~20s
4k-8k tokens	3.4s	~60s

The model processes documents up to 8k tokens (after preprocessing). Longer documents are automatically truncated.

Evaluation

Evaluated on 7,321 test documents from GitHub markdown files and Wikipedia articles:

Metric	All-H1 baseline	Heuristic	md-reheader
Exact match	0.0%	14.5%	56.1%
Per-heading accuracy	13.1%	49.1%	80.6%
Hierarchy preservation	61.3%	68.6%	91.0%
Mean absolute error	1.38	0.62	0.22

Per-level accuracy

	H1	H2	H3	H4	H5	H6
Accuracy	77%	85%	78%	68%	45%	50%

The model is strongest on H1-H3 headings (77-85% accuracy) and still significantly outperforms baselines on deeper levels. Most errors on H4-H6 are off-by-one — the relative structure is preserved even when the absolute level is shifted.

By document depth

Max heading depth	Exact match	Per-heading accuracy	Hierarchy
Depth 2 (flat)	83%	91%	95%
Depth 3	54%	82%	90%
Depth 4	32%	70%	88%
Depth 5-6	33%	65%	89%

By source

Source	Exact match	Per-heading accuracy
GitHub markdown	49.5%	74.0%
Wikipedia	71.3%	95.5%

How It Works

Extract headings from the document using markdown-it-py (CommonMark parser, correctly skips code blocks)
Flatten all headings to # (level 1) — the model should not trust input heading levels
Strip body text to first 128 + last 128 tokens per section — preserves structural cues without bloating context
Predict heading levels using the fine-tuned Qwen3-0.6B model
Apply predicted levels back to the original document

The model outputs headings with correct # prefixes (e.g., ## Methods, ### Data Collection), leveraging pretraining knowledge about heading semantics.

Limitations

Deep nesting (H5/H6): Accuracy drops to 45-50%. The model preserves relative structure but tends to compress deep hierarchies by 1-2 levels.
Ambiguous structure: Heading levels are inherently subjective. The model learns common conventions but cannot resolve genuine ambiguity.
Long documents: Documents exceeding ~8k tokens after stripping are truncated from the end. Headings beyond the cutoff retain their original levels.

Training

The model is a fine-tuned Qwen/Qwen3-0.6B trained on ~197k markdown documents:

codeparrot/github-code: ~105k markdown files from GitHub repositories
euirim/goodwiki: ~45k Wikipedia articles
Deep documents (depth 4+) oversampled 2-8x to address class imbalance

Trained with Axolotl on 2x RTX 4090 using DDP, BF16, 8k sequence length with sample packing.

Reproducing

git clone https://github.com/joelbarmettlerUZH/md-reheader.git
cd md-reheader

uv sync --extra train    # install training dependencies
make download             # download raw data (~150k documents)
make prepare              # strip, flatten, oversample, format
make train                # train on 2x GPU
make eval                 # evaluate on test set

License

Code and model weights: Apache 2.0

Training data includes Wikipedia content (CC BY-SA 4.0) and GitHub repositories (various open-source licenses).

Citation

@software{barmettler2026mdreheader,
  author = {Barmettler, Joel},
  title = {md-reheader: Restoring Heading Hierarchy in Markdown Documents},
  year = {2026},
  url = {https://github.com/joelbarmettlerUZH/md-reheader}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Apr 17, 2026

0.2.1

Apr 17, 2026

0.1.3

Apr 5, 2026

0.1.2

Apr 5, 2026

0.1.1

Apr 5, 2026

This version

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

md_reheader-0.1.0.tar.gz (17.1 MB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

md_reheader-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file md_reheader-0.1.0.tar.gz.

File metadata

Download URL: md_reheader-0.1.0.tar.gz
Upload date: Apr 5, 2026
Size: 17.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for md_reheader-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0e57ceb7d06bfe38607130e72d3ede5e3c7c55010426a6af5d74eaba546780bb`
MD5	`2f0dd8aba3d3d8920c9eec2e69883300`
BLAKE2b-256	`48f9263d269e197f0687449b262dd31ad521d4942394c94e963da75b5c920e28`

See more details on using hashes here.

File details

Details for the file md_reheader-0.1.0-py3-none-any.whl.

File metadata

Download URL: md_reheader-0.1.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for md_reheader-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7a7be3c0c723c760d22c7bdf6288e937b1971edf6a975addfe37f1e27a82ef1a`
MD5	`0396cf11e9df5d68fb2daa6a5b65d1b7`
BLAKE2b-256	`790bd1a6ce313cbe9adf015420aff2311d118f20019d2476c507b32446c4cc75`

See more details on using hashes here.

md-reheader 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

md-reheader

Installation

Quick Start

From a file

From a string

Post-processing MinerU / Docling output

GPU vs CPU

Speed

Evaluation

Per-level accuracy

By document depth

By source

How It Works

Limitations

Training

Reproducing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes