Skip to main content

Convert messy PDF-extracted text into clean Markdown files

Project description

pdfcleanerx

Convert messy PDF-extracted text into clean Markdown files.

  • Fully offline – no AI APIs, no cloud services
  • Modular pipeline – swap or extend any layer
  • CLI + Python SDK
  • Production-ready – structured logging, custom exceptions, typed

Installation

pip install pdfcleaner

From source

git https://github.com/namanbhola1888/pdfcleaner
cd pdfcleanerx
pip install -e ".[dev]"

Quick Start

CLI

# Single file → writes output/report.md
pdfcleaner convert report.pdf

# Named output
pdfcleaner convert report.pdf --output clean.md

# Print to stdout
pdfcleaner convert report.pdf --stdout

# Batch (glob)
pdfcleaner convert docs/*.pdf --output-dir ./markdown

# Verbose / debug
pdfcleaner convert report.pdf --verbose

Python SDK

from pdfcleanerx import Converter

# Default: full pipeline → Markdown string
converter = Converter()
markdown = converter.convert("report.pdf")
print(markdown)

# Write directly to file
written_path = converter.convert_to_file("report.pdf", "report.md")

Dependency injection (custom pipeline)

from pdfcleanerx import Converter
from pdfcleanerx.cleaners import CleanerPipeline, WhitespaceCleaner, PageNumberCleaner

# Only run two cleaners
pipeline = CleanerPipeline([PageNumberCleaner(), WhitespaceCleaner()])
converter = Converter(pipeline=pipeline)
markdown = converter.convert("report.pdf")

Configuration

Copy .env.example to .env and edit:

cp .env.example .env

Key settings:

Variable Default Description
LOG_LEVEL INFO DEBUG / INFO / WARNING / ERROR
OUTPUT_DIR ./output Default output directory
HEADING_FONT_SIZE_THRESHOLD 1.2 Font size multiplier for heading detection
CLEANER_PAGE_NUMBERS true Enable/disable page-number removal
CLEANER_LINE_WRAP true Enable/disable line-wrap merging
CLEANER_WHITESPACE true Enable/disable whitespace normalisation
CLEANER_HEADINGS true Enable/disable heading detection
FOOTER_REPEAT_THRESHOLD 3 Pages a footer must repeat on to be stripped

Architecture

CLI (Typer)
    │
    ▼
Converter                  ← SDK entry point (dependency injection)
    │
    ├── BaseExtractor       ← pdfplumber (swappable)
    │       └── Document (pages → blocks + font metadata)
    │
    ├── CleanerPipeline     ← ordered chain
    │       ├── PageNumberCleaner
    │       ├── WhitespaceCleaner
    │       ├── HeadingDetector
    │       └── LineWrapCleaner
    │
    └── BaseFormatter       ← MarkdownFormatter (swappable)

Adding a custom cleaner

from pdfcleanerx.cleaners.base import BaseCleaner
from pdfcleanerx.models import Document

class MyCustomCleaner(BaseCleaner):
    def clean(self, document: Document) -> Document:
        for page in document.pages:
            for block in page.blocks:
                block.text = block.text.replace("©", "")
        return document

# Inject into Converter
from pdfcleanerx.cleaners import CleanerPipeline, PageNumberCleaner, WhitespaceCleaner
from pdfcleanerx import Converter

pipeline = CleanerPipeline([PageNumberCleaner(), WhitespaceCleaner(), MyCustomCleaner()])
converter = Converter(pipeline=pipeline)

Adding a custom formatter (e.g. HTML)

from pdfcleanerx.formatter.base import BaseFormatter
from pdfcleanerx.models import Document

class HtmlFormatter(BaseFormatter):
    def format(self, document: Document) -> str:
        parts = ["<html><body>"]
        for page in document.pages:
            for block in page.blocks:
                level = getattr(block, "_heading_level", 0)
                if level:
                    parts.append(f"<h{level}>{block.text}</h{level}>")
                else:
                    parts.append(f"<p>{block.text}</p>")
        parts.append("</body></html>")
        return "\n".join(parts)

converter = Converter(formatter=HtmlFormatter())

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=pdfcleanerx --cov-report=html

# Lint
ruff check src/ tests/

# Type check
mypy src/

Build & Publish to PyPI

Prerequisites

pip install build twine

Build

python -m build
# Creates dist/pdfcleanerx-0.1.0.tar.gz and dist/pdfcleanerx-0.1.0-py3-none-any.whl

Test on TestPyPI first (recommended)

twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ pdfcleanerx

Publish to PyPI

twine upload dist/*

You will be prompted for your PyPI API token. Store it in ~/.pypirc or set:

export TWINE_USERNAME=__token__
export TWINE_PASSWORD=pypi-<your-token>

Bump version

Edit version in pyproject.toml and src/pdfcleanerx/__init__.py, then rebuild.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfcleaner-0.1.0.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfcleaner-0.1.0-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file pdfcleaner-0.1.0.tar.gz.

File metadata

  • Download URL: pdfcleaner-0.1.0.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for pdfcleaner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3abf7ee58980ef8909fb4b96ce543928a8fe2636fd732ed8dd7ee8f816c8daaa
MD5 fc0f2f1dd2480af663260beeb7829ec7
BLAKE2b-256 96a8cc7992236ba9704345b4cc5da73a546bb117d6fabdbab0545789002e0ce1

See more details on using hashes here.

File details

Details for the file pdfcleaner-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdfcleaner-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for pdfcleaner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51ed5071ad9f55067eda0369a446191e10ef557701342f945d473dfb13e74a24
MD5 29d928f746a565cde17f9be4187d5043
BLAKE2b-256 5b77f04dfb109480b5ade537b5765f3d91dfa63f6385dce13375faa61bd21ffc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page