Skip to main content

A tool for cleaning and formatting markdown documents

Project description

markdowncleaner

A simple Python tool for cleaning and formatting markdown documents. Default configuration with regex patterns for PDFs of academic papers that have been converted to markdown.

Description

markdowncleaner helps you clean up markdown files by removing unwanted content such as:

  • References, bibliographies, and citations
  • Footnotes and endnote references in text
  • Copyright notices and legal disclaimers
  • Acknowledgements and funding information
  • Author information and contact details
  • Specific patterns like DOIs, URLs, and email addresses
  • Short lines and excessive whitespace
  • Duplicate headlines (for example, because paper title and author names were reprinted on every page of a PDF)

This tool is particularly useful for processing academic papers, books, or any markdown document that needs formatting cleanup.

Installation

pip install markdowncleaner

Usage

Basic Usage

from markdowncleaner import MarkdownCleaner
from pathlib import Path

# Create a cleaner with default patterns
cleaner = MarkdownCleaner()

# Clean a markdown file
result_path = cleaner.clean_markdown_file(Path("input.md"))

# Clean a markdown string
text = "# Title\nSome content here. [1]\n\nReferences\n1. Citation"
cleaned_text = cleaner.clean_markdown_string(text)
print(cleaned_text)

Customizing Cleaning Options

from markdowncleaner import MarkdownCleaner, CleanerOptions

# Create custom options
options = CleanerOptions()
options.remove_short_lines = True
options.min_line_length = 50  # custom minimum line length
options.remove_duplicate_headlines = False 
options.remove_footnotes_in_text = True
options.contract_empty_lines = True

# Initialize cleaner with custom options
cleaner = MarkdownCleaner(options=options)

# Use the cleaner as before

Custom Cleaning Patterns

You can also provide custom cleaning patterns:

from markdowncleaner import MarkdownCleaner, CleaningPatterns
from pathlib import Path

# Load custom patterns from a YAML file
custom_patterns = CleaningPatterns.from_yaml(Path("my_patterns.yaml"))

# Initialize cleaner with custom patterns
cleaner = MarkdownCleaner(patterns=custom_patterns)

Configuration

The default cleaning patterns are defined in default_cleaning_patterns.yaml and include:

  • Sections to Remove: Acknowledgements, References, Bibliography, etc.
  • Bad Inline Patterns: Citations, figure references, etc.
  • Bad Lines Patterns: Copyright notices, DOIs, URLs, etc.
  • Footnote Patterns: Footnote references in text that fit the pattern '.1'
  • Replacements: Various character replacements for PDF parsing errors

Options

  • remove_short_lines: Remove lines shorter than min_line_length (default: 70 characters)
  • remove_whole_lines: Remove lines matching specific patterns
  • remove_sections: Remove entire sections based on section headings
  • remove_duplicate_headlines: Remove duplicate headlines based on threshold
  • remove_duplicate_headlines_threshold: Threshold for duplicate headline removal
  • remove_footnotes_in_text: Remove footnote references
  • replace_within_lines: Replace specific patterns within lines
  • remove_within_lines: Remove specific patterns within lines
  • contract_empty_lines: Normalize whitespace
  • crimp_linebreaks: Improve line break formatting

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdowncleaner-0.2.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markdowncleaner-0.2.0-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file markdowncleaner-0.2.0.tar.gz.

File metadata

  • Download URL: markdowncleaner-0.2.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for markdowncleaner-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e4bc2d23cbab09fd80f248b2a2584290fecb0eda3aafe8b5474da2aef838b99d
MD5 f9719bd0d794caedce24ea52cea941e8
BLAKE2b-256 b357c4a2f5bcf250323c5f8d6bf1797539fabb4d1648bae9c58b90270037364f

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdowncleaner-0.2.0.tar.gz:

Publisher: python-publish.yml on josk0/markdowncleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file markdowncleaner-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for markdowncleaner-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dea0833c300e2a0147dcbb89da5ba2df4fe8bf309e319fb3d3aba2b4c6a7ac69
MD5 9d84d78dc3a26562875717a99499057a
BLAKE2b-256 978127fe67bcc32589b3067801853c46be48b73dfded214b07d50f09bee43840

See more details on using hashes here.

Provenance

The following attestation bundles were made for markdowncleaner-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on josk0/markdowncleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page