Parallel batch document conversion, watch mode, and structured extraction — powered by MarkItDown.

These details have not been verified by PyPI

Project links

Project description

repulp

Parallel batch document conversion, watch mode, and structured extraction — powered by MarkItDown.

repulp wraps Microsoft's MarkItDown with a production workflow layer: parallel batch processing, incremental caching, file watching, table extraction, and a rich CLI.

Why repulp?

MarkItDown converts files one at a time. repulp adds everything you need for real-world document pipelines:

Feature	MarkItDown	repulp
Single file conversion	Yes	Yes
Parallel batch conversion	No	Yes (ProcessPoolExecutor)
Incremental cache (skip unchanged)	No	Yes (SHA256 hashing)
Watch mode (auto-convert on save)	No	Yes (watchfiles)
Extract tables as DataFrames/CSV	No	Yes
CLI with progress bars	No	Yes (Rich + Typer)
Config files (`.repulp.toml`)	No	Yes

Supported Formats

PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, HTML, TXT, MD, RST, JSON, XML, YAML, images (JPEG, PNG, GIF, BMP, TIFF, WEBP), audio (MP3, WAV, FLAC), and more via MarkItDown.

Note: Formats like HTML, CSV, PDF, DOCX, PPTX, and XLSX produce the richest Markdown output. Plain text formats (TXT, JSON, YAML, XML, RST) are passed through with minimal transformation by the underlying MarkItDown engine.

Quick Start

pip install repulp

Or with uv:

uv add repulp

Convert a directory with parallel workers:

repulp convert ./documents --workers 4 --output ./markdown

Watch a folder and auto-convert on changes:

repulp watch ./incoming --output ./converted

Extract tables from a PDF as CSV:

repulp extract tables report.pdf --format csv --output ./tables

CLI Reference

`repulp convert`

Convert files, directories, or URLs to Markdown.

# Single file
repulp convert report.pdf

# Directory with parallel workers
repulp convert ./docs --workers 4 --output ./markdown

# Recursive with filters
repulp convert ./docs -r --include "*.pdf,*.docx" --exclude "*.tmp"

# Incremental (skip unchanged files, enabled by default)
repulp convert ./docs
repulp convert ./docs  # second run skips unchanged files

# Force reconvert all
repulp convert ./docs --no-cache

# URL
repulp convert https://example.com/page

# Stdin
cat file.html | repulp convert -

# Output to stdout
repulp convert report.pdf --stdout

# With frontmatter metadata
repulp convert report.pdf --frontmatter

# Different output formats
repulp convert report.pdf --format text
repulp convert report.pdf --format json

Option	Short	Description
`--output`	`-o`	Output directory
`--recursive`	`-r`	Scan subdirectories
`--workers`	`-w`	Parallel workers (0 = auto)
`--no-cache`		Disable incremental cache
`--include`	`-I`	Glob patterns to include
`--exclude`	`-E`	Glob patterns to exclude
`--stdout`	`-s`	Print to stdout
`--frontmatter`	`-f`	Add YAML frontmatter
`--format`	`-F`	Output format: md, text, json
`--no-clean`		Skip markdown post-processing

`repulp watch`

Watch a directory and auto-convert on file changes.

repulp watch ./incoming --output ./converted
repulp watch ./docs --include "*.pdf" --debounce 1000
repulp watch ./docs --on-change "echo converted"

Option	Description
`--output` / `-o`	Output directory
`--include` / `-I`	Glob patterns to include
`--exclude` / `-E`	Glob patterns to exclude
`--no-clean`	Skip markdown cleanup
`--debounce`	Debounce interval in ms (default: 500)
`--on-change`	Shell command after each conversion

`repulp extract`

Extract structured elements from documents.

# Tables as CSV
repulp extract tables report.pdf --format csv

# Tables as JSON
repulp extract tables report.pdf --format json

# Save tables to files
repulp extract tables report.pdf --format csv --output ./tables

# Links
repulp extract links page.html
repulp extract links page.html --format json

# Headings
repulp extract headings report.pdf

# Images
repulp extract images document.docx

Python API

import repulp

# Convert a single file
result = repulp.convert("report.pdf")
print(result.markdown)

# Convert with options
result = repulp.convert("report.pdf", frontmatter=True, format="json")

# Batch convert a directory
result = repulp.batch("./documents", workers=4, recursive=True)
print(f"{result.succeeded}/{result.total} converted in {result.elapsed:.1f}s")

# Incremental batch (skip unchanged)
result = repulp.batch("./documents", incremental=True)
print(f"{result.skipped} skipped, {result.succeeded} converted")

# Extract tables as list of dicts
tables = repulp.extract_tables("report.pdf")
for table in tables:
    for row in table:
        print(row)

# Extract tables as pandas DataFrames
tables = repulp.extract_tables("report.pdf", format="dataframe")
df = tables[0]

# Extract tables as CSV strings
tables = repulp.extract_tables("report.pdf", format="csv")

# Watch a directory
repulp.watch("./incoming", output_dir="./converted")

DataFrame Support

Install with the tables extra for pandas DataFrame support:

pip install repulp[tables]

import repulp

tables = repulp.extract_tables("financials.xlsx", format="dataframe")
df = tables[0]
print(df.describe())
df.to_csv("output.csv", index=False)

Configuration

Create .repulp.toml in your project root:

[repulp]
output_dir = "./markdown"
recursive = true
clean = true
workers = 0          # 0 = auto (CPU count - 1)
include = ["*.pdf", "*.docx", "*.pptx"]
exclude = ["*.tmp"]

Or use [tool.repulp] in pyproject.toml:

[tool.repulp]
output_dir = "./markdown"
recursive = true

CLI flags override config file values.

Architecture

src/repulp/
├── __init__.py       # Public API: convert(), batch(), extract_tables(), watch()
├── cli.py            # Typer CLI with convert, watch, extract subcommands
├── converter.py      # MarkItDown wrapper for single-file conversion
├── engine.py         # Parallel batch engine (ProcessPoolExecutor)
├── cache.py          # Incremental build cache (SHA256 file hashing)
├── watcher.py        # File watcher (watchfiles) for auto-conversion
├── extractor.py      # Table, link, heading, image extraction from Markdown
├── cleaner.py        # Markdown post-processing and cleanup
├── config.py         # TOML config file loading (.repulp.toml / pyproject.toml)
├── fetcher.py        # URL fetching via httpx
├── frontmatter.py    # YAML frontmatter injection
└── formatter.py      # Output format handling (md, text, json)

Libraries Used

repulp is built on top of these libraries:

Library	Purpose
MarkItDown	Core document-to-Markdown conversion engine by Microsoft. Handles PDF, DOCX, PPTX, XLSX, HTML, CSV, images, audio, and more.
Typer	CLI framework built on Click. Provides argument parsing, help generation, and shell completion.
Rich	Terminal formatting — progress bars, tables, panels, colored output.
watchfiles	Rust-backed file watcher. Used for the `watch` command to detect file changes with low latency.
httpx	HTTP client for URL fetching. Used when converting URLs to Markdown.
pandas	(optional) DataFrame support for structured table extraction. Install with `pip install repulp[tables]`.
tomli	TOML parser for `.repulp.toml` config files. Only needed on Python < 3.11 (3.11+ has `tomllib` in stdlib).

Build & Dev Tools

Tool	Purpose
hatchling	Build backend for packaging
uv	Fast Python package manager
pytest	Test framework (162 tests)

Contributing

Contributions are welcome! Here's how to get started:

Setup

git clone https://github.com/5unnykum4r/repulp.git
cd repulp
uv sync --group dev

Running Tests

uv run pytest tests/ -v

Project Conventions

Python 3.10+ — uses from __future__ import annotations for modern type hints
No vague comments — code should be self-documenting; comments explain why, not what
Tests live in tests/ — mirror the source structure (e.g., test_engine.py tests engine.py)
Incremental commits — one logical change per commit

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feat/my-feature)
Write tests for your changes
Make sure all tests pass (uv run pytest tests/ -v)
Commit your changes with a descriptive message
Push to your fork and open a Pull Request

Areas for Contribution

Adding support for new output formats
Performance improvements to the batch engine
Better error messages and diagnostics
Documentation improvements
New extraction types (e.g., code blocks, footnotes)

Samples

The samples/ directory contains example files (HTML, CSV) that demonstrate repulp's conversion capabilities:

# Convert all samples
repulp convert samples/ --output samples/converted --workers 4 --no-cache

# Extract tables from a sample
repulp extract tables samples/architecture.html --format json

License

MIT — Sunny Kumar

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

repulp-0.1.0.tar.gz (167.5 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

repulp-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file repulp-0.1.0.tar.gz.

File metadata

Download URL: repulp-0.1.0.tar.gz
Upload date: Mar 4, 2026
Size: 167.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for repulp-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`68fb5108e2c8d7676711b70ec6afe0810a8d9c04ff1c5dc656b978a2e6797b2f`
MD5	`c666ed07cf62453c1407789d73afc3dd`
BLAKE2b-256	`510ab565cd30ad4828e7582bd354ffd608b4433881b9613d4a54a1b1d2a1405b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for repulp-0.1.0.tar.gz:

Publisher: publish.yml on 5unnykum4r/repulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: repulp-0.1.0.tar.gz
- Subject digest: 68fb5108e2c8d7676711b70ec6afe0810a8d9c04ff1c5dc656b978a2e6797b2f
- Sigstore transparency entry: 1025584356
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: 5unnykum4r/repulp@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/5unnykum4r
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707
- Trigger Event: release

File details

Details for the file repulp-0.1.0-py3-none-any.whl.

File metadata

Download URL: repulp-0.1.0-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 25.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for repulp-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07eefb9425e430cab41a4c05a544133a926c541c9941f6254a02240ce12dfa7f`
MD5	`8df90ab950e4a67b22b933f964d6e70c`
BLAKE2b-256	`66f2e0f35053d491eed1b1b1695e8e817b72ad8dede86789a80d7e1d1537a5fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for repulp-0.1.0-py3-none-any.whl:

Publisher: publish.yml on 5unnykum4r/repulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: repulp-0.1.0-py3-none-any.whl
- Subject digest: 07eefb9425e430cab41a4c05a544133a926c541c9941f6254a02240ce12dfa7f
- Sigstore transparency entry: 1025584426
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: 5unnykum4r/repulp@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/5unnykum4r
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707
- Trigger Event: release

repulp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

repulp

Why repulp?

Supported Formats

Quick Start

CLI Reference

repulp convert

repulp watch

repulp extract

Python API

DataFrame Support

Configuration

Architecture

Libraries Used

Build & Dev Tools

Contributing

Setup

Running Tests

Project Conventions

How to Contribute

Areas for Contribution

Samples

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`repulp convert`

`repulp watch`

`repulp extract`