Skip to main content

Convert messy PDF-extracted text into clean Markdown files

Project description

pdfcleanerx

Convert messy PDF-extracted text into clean Markdown files.

  • Fully offline – no AI APIs, no cloud services
  • Modular pipeline – swap or extend any layer
  • CLI + Python SDK
  • Production-ready – structured logging, custom exceptions, typed

Installation

pip install pdfcleaner

From source

git https://github.com/namanbhola1888/pdfcleaner
cd pdfcleanerx
pip install -e ".[dev]"

Quick Start

CLI

# Single file → writes output/report.md
pdfcleaner convert report.pdf

# Named output
pdfcleaner convert report.pdf --output clean.md

# Print to stdout
pdfcleaner convert report.pdf --stdout

# Batch (glob)
pdfcleaner convert docs/*.pdf --output-dir ./markdown

# Verbose / debug
pdfcleaner convert report.pdf --verbose

Python SDK

from pdfcleanerx import Converter

# Default: full pipeline → Markdown string
converter = Converter()
markdown = converter.convert("report.pdf")
print(markdown)

# Write directly to file
written_path = converter.convert_to_file("report.pdf", "report.md")

Dependency injection (custom pipeline)

from pdfcleanerx import Converter
from pdfcleanerx.cleaners import CleanerPipeline, WhitespaceCleaner, PageNumberCleaner

# Only run two cleaners
pipeline = CleanerPipeline([PageNumberCleaner(), WhitespaceCleaner()])
converter = Converter(pipeline=pipeline)
markdown = converter.convert("report.pdf")

Configuration

Copy .env.example to .env and edit:

cp .env.example .env

Key settings:

Variable Default Description
LOG_LEVEL INFO DEBUG / INFO / WARNING / ERROR
OUTPUT_DIR ./output Default output directory
HEADING_FONT_SIZE_THRESHOLD 1.2 Font size multiplier for heading detection
CLEANER_PAGE_NUMBERS true Enable/disable page-number removal
CLEANER_LINE_WRAP true Enable/disable line-wrap merging
CLEANER_WHITESPACE true Enable/disable whitespace normalisation
CLEANER_HEADINGS true Enable/disable heading detection
FOOTER_REPEAT_THRESHOLD 3 Pages a footer must repeat on to be stripped

Architecture

CLI (Typer)
    │
    ▼
Converter                  ← SDK entry point (dependency injection)
    │
    ├── BaseExtractor       ← pdfplumber (swappable)
    │       └── Document (pages → blocks + font metadata)
    │
    ├── CleanerPipeline     ← ordered chain
    │       ├── PageNumberCleaner
    │       ├── WhitespaceCleaner
    │       ├── HeadingDetector
    │       └── LineWrapCleaner
    │
    └── BaseFormatter       ← MarkdownFormatter (swappable)

Adding a custom cleaner

from pdfcleanerx.cleaners.base import BaseCleaner
from pdfcleanerx.models import Document

class MyCustomCleaner(BaseCleaner):
    def clean(self, document: Document) -> Document:
        for page in document.pages:
            for block in page.blocks:
                block.text = block.text.replace("©", "")
        return document

# Inject into Converter
from pdfcleanerx.cleaners import CleanerPipeline, PageNumberCleaner, WhitespaceCleaner
from pdfcleanerx import Converter

pipeline = CleanerPipeline([PageNumberCleaner(), WhitespaceCleaner(), MyCustomCleaner()])
converter = Converter(pipeline=pipeline)

Adding a custom formatter (e.g. HTML)

from pdfcleanerx.formatter.base import BaseFormatter
from pdfcleanerx.models import Document

class HtmlFormatter(BaseFormatter):
    def format(self, document: Document) -> str:
        parts = ["<html><body>"]
        for page in document.pages:
            for block in page.blocks:
                level = getattr(block, "_heading_level", 0)
                if level:
                    parts.append(f"<h{level}>{block.text}</h{level}>")
                else:
                    parts.append(f"<p>{block.text}</p>")
        parts.append("</body></html>")
        return "\n".join(parts)

converter = Converter(formatter=HtmlFormatter())

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=pdfcleanerx --cov-report=html

# Lint
ruff check src/ tests/

# Type check
mypy src/

Build & Publish to PyPI

Prerequisites

pip install build twine

Build

python -m build
# Creates dist/pdfcleanerx-0.1.0.tar.gz and dist/pdfcleanerx-0.1.0-py3-none-any.whl

Test on TestPyPI first (recommended)

twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ pdfcleanerx

Publish to PyPI

twine upload dist/*

You will be prompted for your PyPI API token. Store it in ~/.pypirc or set:

export TWINE_USERNAME=__token__
export TWINE_PASSWORD=pypi-<your-token>

Bump version

Edit version in pyproject.toml and src/pdfcleanerx/__init__.py, then rebuild.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfcleaner-0.1.1.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfcleaner-0.1.1-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file pdfcleaner-0.1.1.tar.gz.

File metadata

  • Download URL: pdfcleaner-0.1.1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for pdfcleaner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 887a034d633faaf9b4a6af33880a215c6f4e3f96e7a40af8c1a1a641850ff14f
MD5 be0cd6073360f7ac2d1ec35934ee7a71
BLAKE2b-256 db59aba37db82e8cb33f53fdb621319c67ca22f4b4e02590f7a0c96a66e77faa

See more details on using hashes here.

File details

Details for the file pdfcleaner-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfcleaner-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for pdfcleaner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb6e143aea163df36c7d3aabe05ae330227b38a8421b3cc2ba4cad7282259a9b
MD5 a5b3f917cd4ba0774c7111382d1ad125
BLAKE2b-256 58cfcc935e70bf1e1663c6ef6d0507715769dde997a385ad56a5a339ae7d1c4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page