A tool for cleaning and formatting markdown documents
Project description
markdowncleaner
A simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.
Description
markdowncleaner helps you clean up markdown files by removing unwanted content such as:
- References, bibliographies, and citations
- Footnotes and endnote references in text
- Copyright notices and legal disclaimers
- Acknowledgements and funding information
- Author information and contact details
- Specific patterns like DOIs, URLs, and email addresses
- Short lines and excessive whitespace
- Duplicate headlines (for example, because paper title and author names were reprinted on every page of a PDF)
This tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.
Installation
pip install markdowncleaner
Usage
Basic Usage
from markdowncleaner import MarkdownCleaner
from pathlib import Path
# Create a cleaner with default patterns
cleaner = MarkdownCleaner()
# Clean a markdown file
result_path = cleaner.clean_markdown_file(Path("input.md"))
# Clean a markdown string
text = "# Title\nSome content here. [1]\n\nReferences\n1. Citation"
cleaned_text = cleaner.clean_markdown_string(text)
print(cleaned_text)
Customizing Cleaning Options
from markdowncleaner import MarkdownCleaner, CleanerOptions
# Create custom options
options = CleanerOptions()
options.remove_short_lines = True
options.min_line_length = 50 # custom minimum line length
options.remove_duplicate_headlines = False
options.remove_footnotes_in_text = True
options.contract_empty_lines = True
# Initialize cleaner with custom options
cleaner = MarkdownCleaner(options=options)
# Use the cleaner as before
Custom Cleaning Patterns
You can also provide custom cleaning patterns:
from markdowncleaner import MarkdownCleaner, CleaningPatterns
from pathlib import Path
# Load custom patterns from a YAML file
custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))
# Initialize cleaner with custom patterns
cleaner = MarkdownCleaner(patterns=custom_patterns)
Configuration
The default cleaning patterns are defined in default_cleaning_patterns.yaml and include:
- Sections to Remove: Acknowledgements, References, Bibliography, etc.
- Bad Inline Patterns: Citations, figure references, etc.
- Bad Lines Patterns: Copyright notices, DOIs, URLs, etc.
- Footnote Patterns: Footnote references in text that fit the pattern '.1'
- Replacements: Various character replacements for PDF parsing errors
Options
remove_short_lines: Remove lines shorter thanmin_line_length(default: 70 characters)remove_whole_lines: Remove lines matching specific patternsremove_sections: Remove entire sections based on section headingsremove_duplicate_headlines: Remove duplicate headlines based on thresholdremove_duplicate_headlines_threshold: Threshold for duplicate headline removalremove_footnotes_in_text: Remove footnote referencesreplace_within_lines: Replace specific patterns within linesremove_within_lines: Remove specific patterns within linescontract_empty_lines: Normalize whitespacecrimp_linebreaks: Improve line break formatting
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markdowncleaner-0.2.0.tar.gz.
File metadata
- Download URL: markdowncleaner-0.2.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4bc2d23cbab09fd80f248b2a2584290fecb0eda3aafe8b5474da2aef838b99d
|
|
| MD5 |
f9719bd0d794caedce24ea52cea941e8
|
|
| BLAKE2b-256 |
b357c4a2f5bcf250323c5f8d6bf1797539fabb4d1648bae9c58b90270037364f
|
Provenance
The following attestation bundles were made for markdowncleaner-0.2.0.tar.gz:
Publisher:
python-publish.yml on josk0/markdowncleaner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdowncleaner-0.2.0.tar.gz -
Subject digest:
e4bc2d23cbab09fd80f248b2a2584290fecb0eda3aafe8b5474da2aef838b99d - Sigstore transparency entry: 176399528
- Sigstore integration time:
-
Permalink:
josk0/markdowncleaner@941417b8699bedb7080ee69877a5ef93eed9ba8d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/josk0
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@941417b8699bedb7080ee69877a5ef93eed9ba8d -
Trigger Event:
release
-
Statement type:
File details
Details for the file markdowncleaner-0.2.0-py3-none-any.whl.
File metadata
- Download URL: markdowncleaner-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dea0833c300e2a0147dcbb89da5ba2df4fe8bf309e319fb3d3aba2b4c6a7ac69
|
|
| MD5 |
9d84d78dc3a26562875717a99499057a
|
|
| BLAKE2b-256 |
978127fe67bcc32589b3067801853c46be48b73dfded214b07d50f09bee43840
|
Provenance
The following attestation bundles were made for markdowncleaner-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on josk0/markdowncleaner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markdowncleaner-0.2.0-py3-none-any.whl -
Subject digest:
dea0833c300e2a0147dcbb89da5ba2df4fe8bf309e319fb3d3aba2b4c6a7ac69 - Sigstore transparency entry: 176399530
- Sigstore integration time:
-
Permalink:
josk0/markdowncleaner@941417b8699bedb7080ee69877a5ef93eed9ba8d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/josk0
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@941417b8699bedb7080ee69877a5ef93eed9ba8d -
Trigger Event:
release
-
Statement type: