Skip to main content

Parallel batch document conversion, watch mode, and structured extraction — powered by MarkItDown.

Project description

repulp

Python 3.10+ License: MIT Tests

Parallel batch document conversion, watch mode, and structured extraction — powered by MarkItDown.

repulp wraps Microsoft's MarkItDown with a production workflow layer: parallel batch processing, incremental caching, file watching, table extraction, and a rich CLI.

Why repulp?

MarkItDown converts files one at a time. repulp adds everything you need for real-world document pipelines:

Feature MarkItDown repulp
Single file conversion Yes Yes
Parallel batch conversion No Yes (ProcessPoolExecutor)
Incremental cache (skip unchanged) No Yes (SHA256 hashing)
Watch mode (auto-convert on save) No Yes (watchfiles)
Extract tables as DataFrames/CSV No Yes
CLI with progress bars No Yes (Rich + Typer)
Config files (.repulp.toml) No Yes

Supported Formats

PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, HTML, TXT, MD, RST, JSON, XML, YAML, images (JPEG, PNG, GIF, BMP, TIFF, WEBP), audio (MP3, WAV, FLAC), and more via MarkItDown.

Note: Formats like HTML, CSV, PDF, DOCX, PPTX, and XLSX produce the richest Markdown output. Plain text formats (TXT, JSON, YAML, XML, RST) are passed through with minimal transformation by the underlying MarkItDown engine.

Quick Start

pip install repulp

Or with uv:

uv add repulp

Convert a directory with parallel workers:

repulp convert ./documents --workers 4 --output ./markdown

Watch a folder and auto-convert on changes:

repulp watch ./incoming --output ./converted

Extract tables from a PDF as CSV:

repulp extract tables report.pdf --format csv --output ./tables

CLI Reference

repulp convert

Convert files, directories, or URLs to Markdown.

# Single file
repulp convert report.pdf

# Directory with parallel workers
repulp convert ./docs --workers 4 --output ./markdown

# Recursive with filters
repulp convert ./docs -r --include "*.pdf,*.docx" --exclude "*.tmp"

# Incremental (skip unchanged files, enabled by default)
repulp convert ./docs
repulp convert ./docs  # second run skips unchanged files

# Force reconvert all
repulp convert ./docs --no-cache

# URL
repulp convert https://example.com/page

# Stdin
cat file.html | repulp convert -

# Output to stdout
repulp convert report.pdf --stdout

# With frontmatter metadata
repulp convert report.pdf --frontmatter

# Different output formats
repulp convert report.pdf --format text
repulp convert report.pdf --format json
Option Short Description
--output -o Output directory
--recursive -r Scan subdirectories
--workers -w Parallel workers (0 = auto)
--no-cache Disable incremental cache
--include -I Glob patterns to include
--exclude -E Glob patterns to exclude
--stdout -s Print to stdout
--frontmatter -f Add YAML frontmatter
--format -F Output format: md, text, json
--no-clean Skip markdown post-processing

repulp watch

Watch a directory and auto-convert on file changes.

repulp watch ./incoming --output ./converted
repulp watch ./docs --include "*.pdf" --debounce 1000
repulp watch ./docs --on-change "echo converted"
Option Description
--output / -o Output directory
--include / -I Glob patterns to include
--exclude / -E Glob patterns to exclude
--no-clean Skip markdown cleanup
--debounce Debounce interval in ms (default: 500)
--on-change Shell command after each conversion

repulp extract

Extract structured elements from documents.

# Tables as CSV
repulp extract tables report.pdf --format csv

# Tables as JSON
repulp extract tables report.pdf --format json

# Save tables to files
repulp extract tables report.pdf --format csv --output ./tables

# Links
repulp extract links page.html
repulp extract links page.html --format json

# Headings
repulp extract headings report.pdf

# Images
repulp extract images document.docx

Python API

import repulp

# Convert a single file
result = repulp.convert("report.pdf")
print(result.markdown)

# Convert with options
result = repulp.convert("report.pdf", frontmatter=True, format="json")

# Batch convert a directory
result = repulp.batch("./documents", workers=4, recursive=True)
print(f"{result.succeeded}/{result.total} converted in {result.elapsed:.1f}s")

# Incremental batch (skip unchanged)
result = repulp.batch("./documents", incremental=True)
print(f"{result.skipped} skipped, {result.succeeded} converted")

# Extract tables as list of dicts
tables = repulp.extract_tables("report.pdf")
for table in tables:
    for row in table:
        print(row)

# Extract tables as pandas DataFrames
tables = repulp.extract_tables("report.pdf", format="dataframe")
df = tables[0]

# Extract tables as CSV strings
tables = repulp.extract_tables("report.pdf", format="csv")

# Watch a directory
repulp.watch("./incoming", output_dir="./converted")

DataFrame Support

Install with the tables extra for pandas DataFrame support:

pip install repulp[tables]
import repulp

tables = repulp.extract_tables("financials.xlsx", format="dataframe")
df = tables[0]
print(df.describe())
df.to_csv("output.csv", index=False)

Configuration

Create .repulp.toml in your project root:

[repulp]
output_dir = "./markdown"
recursive = true
clean = true
workers = 0          # 0 = auto (CPU count - 1)
include = ["*.pdf", "*.docx", "*.pptx"]
exclude = ["*.tmp"]

Or use [tool.repulp] in pyproject.toml:

[tool.repulp]
output_dir = "./markdown"
recursive = true

CLI flags override config file values.

Architecture

src/repulp/
├── __init__.py       # Public API: convert(), batch(), extract_tables(), watch()
├── cli.py            # Typer CLI with convert, watch, extract subcommands
├── converter.py      # MarkItDown wrapper for single-file conversion
├── engine.py         # Parallel batch engine (ProcessPoolExecutor)
├── cache.py          # Incremental build cache (SHA256 file hashing)
├── watcher.py        # File watcher (watchfiles) for auto-conversion
├── extractor.py      # Table, link, heading, image extraction from Markdown
├── cleaner.py        # Markdown post-processing and cleanup
├── config.py         # TOML config file loading (.repulp.toml / pyproject.toml)
├── fetcher.py        # URL fetching via httpx
├── frontmatter.py    # YAML frontmatter injection
└── formatter.py      # Output format handling (md, text, json)

Libraries Used

repulp is built on top of these libraries:

Library Purpose
MarkItDown Core document-to-Markdown conversion engine by Microsoft. Handles PDF, DOCX, PPTX, XLSX, HTML, CSV, images, audio, and more.
Typer CLI framework built on Click. Provides argument parsing, help generation, and shell completion.
Rich Terminal formatting — progress bars, tables, panels, colored output.
watchfiles Rust-backed file watcher. Used for the watch command to detect file changes with low latency.
httpx HTTP client for URL fetching. Used when converting URLs to Markdown.
pandas (optional) DataFrame support for structured table extraction. Install with pip install repulp[tables].
tomli TOML parser for .repulp.toml config files. Only needed on Python < 3.11 (3.11+ has tomllib in stdlib).

Build & Dev Tools

Tool Purpose
hatchling Build backend for packaging
uv Fast Python package manager
pytest Test framework (162 tests)

Contributing

Contributions are welcome! Here's how to get started:

Setup

git clone https://github.com/5unnykum4r/repulp.git
cd repulp
uv sync --group dev

Running Tests

uv run pytest tests/ -v

Project Conventions

  • Python 3.10+ — uses from __future__ import annotations for modern type hints
  • No vague comments — code should be self-documenting; comments explain why, not what
  • Tests live in tests/ — mirror the source structure (e.g., test_engine.py tests engine.py)
  • Incremental commits — one logical change per commit

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feat/my-feature)
  3. Write tests for your changes
  4. Make sure all tests pass (uv run pytest tests/ -v)
  5. Commit your changes with a descriptive message
  6. Push to your fork and open a Pull Request

Areas for Contribution

  • Adding support for new output formats
  • Performance improvements to the batch engine
  • Better error messages and diagnostics
  • Documentation improvements
  • New extraction types (e.g., code blocks, footnotes)

Samples

The samples/ directory contains example files (HTML, CSV) that demonstrate repulp's conversion capabilities:

# Convert all samples
repulp convert samples/ --output samples/converted --workers 4 --no-cache

# Extract tables from a sample
repulp extract tables samples/architecture.html --format json

License

MIT — Sunny Kumar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repulp-0.1.0.tar.gz (167.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

repulp-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file repulp-0.1.0.tar.gz.

File metadata

  • Download URL: repulp-0.1.0.tar.gz
  • Upload date:
  • Size: 167.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for repulp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 68fb5108e2c8d7676711b70ec6afe0810a8d9c04ff1c5dc656b978a2e6797b2f
MD5 c666ed07cf62453c1407789d73afc3dd
BLAKE2b-256 510ab565cd30ad4828e7582bd354ffd608b4433881b9613d4a54a1b1d2a1405b

See more details on using hashes here.

Provenance

The following attestation bundles were made for repulp-0.1.0.tar.gz:

Publisher: publish.yml on 5unnykum4r/repulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file repulp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: repulp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for repulp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07eefb9425e430cab41a4c05a544133a926c541c9941f6254a02240ce12dfa7f
MD5 8df90ab950e4a67b22b933f964d6e70c
BLAKE2b-256 66f2e0f35053d491eed1b1b1695e8e817b72ad8dede86789a80d7e1d1537a5fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for repulp-0.1.0-py3-none-any.whl:

Publisher: publish.yml on 5unnykum4r/repulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page