A tool for cleaning and formatting markdown documents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

markdowncleaner

A simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.

I use this myself in a workflow that processes academic PDFs using docling or olmOCR. The default configuration fits that use case.

Description

markdowncleaner removes unwanted content such as:

References, bibliographies, and citations (including heuristic detection of bibliographic lines)
Footnotes and endnote references in text
Copyright notices and legal disclaimers
Acknowledgements and funding information
Author information and contact details
Specific patterns like DOIs, URLs, and email addresses
Short lines and excessive whitespace
Duplicate headlines (for example, because paper title and author names were reprinted on every page of a PDF)
Erroneous line breaks from PDF conversion

Installation

Requires Python 3.10 or higher.

pip install markdowncleaner

Usage

Python API

Basic Usage

from markdowncleaner import MarkdownCleaner
from pathlib import Path

# Create a cleaner with default patterns
cleaner = MarkdownCleaner()

# Clean a markdown file
result_path = cleaner.clean_markdown_file(Path("input.md"))

# Clean a markdown string
text = "# Title\nSome content here. [1]\n\nReferences\n1. Citation"
cleaned_text = cleaner.clean_markdown_string(text)
print(cleaned_text)

Customizing Cleaning Options

from markdowncleaner import MarkdownCleaner, CleanerOptions

# Create custom options
options = CleanerOptions()
options.remove_short_lines = True
options.min_line_length = 50  # custom minimum line length
options.remove_duplicate_headlines = False
options.remove_footnotes_in_text = True
options.contract_empty_lines = True
options.fix_encoding_mojibake = True
options.normalize_quotation_symbols = True

# Initialize cleaner with custom options
cleaner = MarkdownCleaner(options=options)

# Use the cleaner as before

Custom Cleaning Patterns

You can also provide custom cleaning patterns:

from markdowncleaner import MarkdownCleaner
from markdowncleaner.config.loader import CleaningPatterns
from pathlib import Path

# Load custom patterns from a YAML file
custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))

# Initialize cleaner with custom patterns
cleaner = MarkdownCleaner(patterns=custom_patterns)

Command Line Interface

Clean a single markdown file using the CLI:

# Basic usage - creates a new file with "_cleaned" suffix
markdowncleaner input.md

# Specify output file
markdowncleaner input.md -o output.md

# Specify output directory
markdowncleaner input.md --output-dir cleaned_files/

# Use custom configuration
markdowncleaner input.md --config my_patterns.yaml

# Enable encoding fixes and quotation normalization
markdowncleaner input.md --fix-encoding --normalize-quotation

# Customize line length threshold
markdowncleaner input.md --min-line-length 50

# Disable specific cleaning operations
markdowncleaner input.md --keep-short-lines --keep-sections --keep-footnotes

# Disable replacements and inline pattern removal
markdowncleaner input.md --no-replacements --keep-inline-patterns

# Disable formatting operations
markdowncleaner input.md --no-crimping --keep-empty-lines

# Keep references (disable heuristic reference detection)
markdowncleaner input.md --keep-references

Available CLI Options:

-o, --output: Path to save the cleaned markdown file
--output-dir: Directory to save the cleaned file
--config: Path to custom YAML configuration file
--fix-encoding: Fix encoding mojibake issues
--normalize-quotation: Normalize quotation symbols to standard ASCII
--keep-short-lines: Don't remove lines shorter than minimum length
--min-line-length: Minimum line length to keep (default: 70)
--keep-bad-lines: Don't remove lines matching bad line patterns
--keep-sections: Don't remove sections like References, Acknowledgements
--keep-duplicate-headlines: Don't remove duplicate headlines
--keep-footnotes: Don't remove footnote references in text
--no-replacements: Don't perform text replacements
--keep-inline-patterns: Don't remove inline patterns like citations
--keep-empty-lines: Don't contract consecutive empty lines
--no-crimping: Don't crimp linebreaks (fix line break errors from PDF conversion)
--keep-references: Don't heuristically detect and remove bibliographic reference lines

Batch Processing Script

For processing multiple markdown files in a folder and its subfolders, use the included batch processing script:

# Basic usage - will prompt for confirmation
python scripts/clean_mds_in_folder.py documents/

# Skip confirmation prompt
python scripts/clean_mds_in_folder.py documents/ --yes

# Use 8 parallel workers (default is your CPU count)
python scripts/clean_mds_in_folder.py documents/ --workers 8

# Use custom cleaning patterns
python scripts/clean_mds_in_folder.py documents/ --config my_patterns.yaml

# Combine options
python scripts/clean_mds_in_folder.py documents/ --yes --workers 4

Features:

Recursively finds all .md files in the specified folder and subfolders
Processes files in parallel using multiple CPU cores for faster processing
Shows real-time progress bar with tqdm
Cleans files in-place (modifies original files)
Asks for confirmation before processing (unless --yes is used)
Continues processing even if some files fail
Reports all successful and failed files at the end

Script Options:

folder: Path to folder containing markdown files (required)
-y, --yes: Skip confirmation prompt and proceed immediately
-w, --workers: Number of parallel workers (default: CPU count)
--config: Path to custom YAML configuration file

Note: Requires tqdm for the progress bar:

pip install tqdm

Configuration

The default cleaning patterns are defined in default_cleaning_patterns.yaml and include:

Sections to Remove: Acknowledgements, References, Bibliography, etc.
Bad Inline Patterns: Citations, figure references, etc.
Bad Lines Patterns: Copyright notices, DOIs, URLs, etc.
Footnote Patterns: Footnote references in text that fit the pattern '.1'
Replacements: Various character replacements for PDF parsing errors

Options

All available CleanerOptions:

fix_encoding_mojibake: Fix encoding issues and mojibake using ftfy (default: False)
normalize_quotation_symbols: Normalize various quotation marks to standard ASCII quotes (default: False)
remove_short_lines: Remove lines shorter than min_line_length (default: True)
min_line_length: Minimum line length to keep when remove_short_lines is enabled (default: 70)
remove_whole_lines: Remove lines matching specific patterns (default: True)
remove_sections: Remove entire sections based on section headings (default: True)
remove_duplicate_headlines: Remove duplicate headlines based on threshold (default: True)
remove_duplicate_headlines_threshold: Number of occurrences needed to consider a headline duplicate (default: 2)
remove_footnotes_in_text: Remove footnote references like ".1" or ".23" (default: True)
replace_within_lines: Replace specific patterns within lines (default: True)
remove_within_lines: Remove specific patterns within lines (default: True)
contract_empty_lines: Reduce multiple consecutive empty lines to one (default: True)
crimp_linebreaks: Fix line break errors from PDF conversion (default: True)
remove_references_heuristically: Heuristically detect and remove bibliographic reference lines by scoring lines based on bibliographic patterns (default: True)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jhimmelreich

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Dec 29, 2025

0.3.0

Oct 23, 2025

0.2.0

Mar 3, 2025

0.1.1

Mar 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdowncleaner-0.3.1.tar.gz (27.1 kB view details)

Uploaded Dec 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markdowncleaner-0.3.1-py3-none-any.whl (17.8 kB view details)

Uploaded Dec 29, 2025 Python 3

File details

Details for the file markdowncleaner-0.3.1.tar.gz.

File metadata

Download URL: markdowncleaner-0.3.1.tar.gz
Upload date: Dec 29, 2025
Size: 27.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdowncleaner-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`a1b0216bf3921c99e82c99ea314a3b2b2c86ac937df9ae9fc6290a792e3395a5`
MD5	`87ac6186be0b08881c7a6d922119663a`
BLAKE2b-256	`8bd5f130d61b69e51f05def692e70cf6fd04a2dc71a9adea988aed3f24e9b47e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdowncleaner-0.3.1.tar.gz:

Publisher: python-publish.yml on josk0/markdowncleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdowncleaner-0.3.1.tar.gz
- Subject digest: a1b0216bf3921c99e82c99ea314a3b2b2c86ac937df9ae9fc6290a792e3395a5
- Sigstore transparency entry: 781039306
- Sigstore integration time: Dec 29, 2025
Source repository:
- Permalink: josk0/markdowncleaner@22b7e26dde18c721237098e3d5c31482ce0143a8
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/josk0
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@22b7e26dde18c721237098e3d5c31482ce0143a8
- Trigger Event: release

File details

Details for the file markdowncleaner-0.3.1-py3-none-any.whl.

File metadata

Download URL: markdowncleaner-0.3.1-py3-none-any.whl
Upload date: Dec 29, 2025
Size: 17.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for markdowncleaner-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63306f27d5f2e00729ce2e5af8d2067b2c87c19c7867d6382fddc59e0788689b`
MD5	`b9133109cd733b6b1d0066374b9b7462`
BLAKE2b-256	`e09d095bedfaee2c30ed5e82077436d5f3ca9d1a8b55cba12ee7c4fd8bfd5860`

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdowncleaner-0.3.1-py3-none-any.whl:

Publisher: python-publish.yml on josk0/markdowncleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: markdowncleaner-0.3.1-py3-none-any.whl
- Subject digest: 63306f27d5f2e00729ce2e5af8d2067b2c87c19c7867d6382fddc59e0788689b
- Sigstore transparency entry: 781039308
- Sigstore integration time: Dec 29, 2025
Source repository:
- Permalink: josk0/markdowncleaner@22b7e26dde18c721237098e3d5c31482ce0143a8
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/josk0
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@22b7e26dde18c721237098e3d5c31482ce0143a8
- Trigger Event: release

markdowncleaner 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

markdowncleaner

Description

Installation

Usage

Python API

Basic Usage

Customizing Cleaning Options

Custom Cleaning Patterns

Command Line Interface

Batch Processing Script

Configuration

Options

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance