Parallel batch document conversion, watch mode, and structured extraction — powered by MarkItDown.
Project description
repulp
Parallel batch document conversion, watch mode, and structured extraction — powered by MarkItDown.
repulp wraps Microsoft's MarkItDown with a production workflow layer: parallel batch processing, incremental caching, file watching, table extraction, and a rich CLI.
Why repulp?
MarkItDown converts files one at a time. repulp adds everything you need for real-world document pipelines:
| Feature | MarkItDown | repulp |
|---|---|---|
| Single file conversion | Yes | Yes |
| Parallel batch conversion | No | Yes (ProcessPoolExecutor) |
| Incremental cache (skip unchanged) | No | Yes (SHA256 hashing) |
| Watch mode (auto-convert on save) | No | Yes (watchfiles) |
| Extract tables as DataFrames/CSV | No | Yes |
| CLI with progress bars | No | Yes (Rich + Typer) |
Config files (.repulp.toml) |
No | Yes |
Supported Formats
PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, HTML, TXT, MD, RST, JSON, XML, YAML, images (JPEG, PNG, GIF, BMP, TIFF, WEBP), audio (MP3, WAV, FLAC), and more via MarkItDown.
Note: Formats like HTML, CSV, PDF, DOCX, PPTX, and XLSX produce the richest Markdown output. Plain text formats (TXT, JSON, YAML, XML, RST) are passed through with minimal transformation by the underlying MarkItDown engine.
Quick Start
pip install repulp
Or with uv:
uv add repulp
Convert a directory with parallel workers:
repulp convert ./documents --workers 4 --output ./markdown
Watch a folder and auto-convert on changes:
repulp watch ./incoming --output ./converted
Extract tables from a PDF as CSV:
repulp extract tables report.pdf --format csv --output ./tables
CLI Reference
repulp convert
Convert files, directories, or URLs to Markdown.
# Single file
repulp convert report.pdf
# Directory with parallel workers
repulp convert ./docs --workers 4 --output ./markdown
# Recursive with filters
repulp convert ./docs -r --include "*.pdf,*.docx" --exclude "*.tmp"
# Incremental (skip unchanged files, enabled by default)
repulp convert ./docs
repulp convert ./docs # second run skips unchanged files
# Force reconvert all
repulp convert ./docs --no-cache
# URL
repulp convert https://example.com/page
# Stdin
cat file.html | repulp convert -
# Output to stdout
repulp convert report.pdf --stdout
# With frontmatter metadata
repulp convert report.pdf --frontmatter
# Different output formats
repulp convert report.pdf --format text
repulp convert report.pdf --format json
| Option | Short | Description |
|---|---|---|
--output |
-o |
Output directory |
--recursive |
-r |
Scan subdirectories |
--workers |
-w |
Parallel workers (0 = auto) |
--no-cache |
Disable incremental cache | |
--include |
-I |
Glob patterns to include |
--exclude |
-E |
Glob patterns to exclude |
--stdout |
-s |
Print to stdout |
--frontmatter |
-f |
Add YAML frontmatter |
--format |
-F |
Output format: md, text, json |
--no-clean |
Skip markdown post-processing |
repulp watch
Watch a directory and auto-convert on file changes.
repulp watch ./incoming --output ./converted
repulp watch ./docs --include "*.pdf" --debounce 1000
repulp watch ./docs --on-change "echo converted"
| Option | Description |
|---|---|
--output / -o |
Output directory |
--include / -I |
Glob patterns to include |
--exclude / -E |
Glob patterns to exclude |
--no-clean |
Skip markdown cleanup |
--debounce |
Debounce interval in ms (default: 500) |
--on-change |
Shell command after each conversion |
repulp extract
Extract structured elements from documents.
# Tables as CSV
repulp extract tables report.pdf --format csv
# Tables as JSON
repulp extract tables report.pdf --format json
# Save tables to files
repulp extract tables report.pdf --format csv --output ./tables
# Links
repulp extract links page.html
repulp extract links page.html --format json
# Headings
repulp extract headings report.pdf
# Images
repulp extract images document.docx
Python API
import repulp
# Convert a single file
result = repulp.convert("report.pdf")
print(result.markdown)
# Convert with options
result = repulp.convert("report.pdf", frontmatter=True, format="json")
# Batch convert a directory
result = repulp.batch("./documents", workers=4, recursive=True)
print(f"{result.succeeded}/{result.total} converted in {result.elapsed:.1f}s")
# Incremental batch (skip unchanged)
result = repulp.batch("./documents", incremental=True)
print(f"{result.skipped} skipped, {result.succeeded} converted")
# Extract tables as list of dicts
tables = repulp.extract_tables("report.pdf")
for table in tables:
for row in table:
print(row)
# Extract tables as pandas DataFrames
tables = repulp.extract_tables("report.pdf", format="dataframe")
df = tables[0]
# Extract tables as CSV strings
tables = repulp.extract_tables("report.pdf", format="csv")
# Watch a directory
repulp.watch("./incoming", output_dir="./converted")
DataFrame Support
Install with the tables extra for pandas DataFrame support:
pip install repulp[tables]
import repulp
tables = repulp.extract_tables("financials.xlsx", format="dataframe")
df = tables[0]
print(df.describe())
df.to_csv("output.csv", index=False)
Configuration
Create .repulp.toml in your project root:
[repulp]
output_dir = "./markdown"
recursive = true
clean = true
workers = 0 # 0 = auto (CPU count - 1)
include = ["*.pdf", "*.docx", "*.pptx"]
exclude = ["*.tmp"]
Or use [tool.repulp] in pyproject.toml:
[tool.repulp]
output_dir = "./markdown"
recursive = true
CLI flags override config file values.
Architecture
src/repulp/
├── __init__.py # Public API: convert(), batch(), extract_tables(), watch()
├── cli.py # Typer CLI with convert, watch, extract subcommands
├── converter.py # MarkItDown wrapper for single-file conversion
├── engine.py # Parallel batch engine (ProcessPoolExecutor)
├── cache.py # Incremental build cache (SHA256 file hashing)
├── watcher.py # File watcher (watchfiles) for auto-conversion
├── extractor.py # Table, link, heading, image extraction from Markdown
├── cleaner.py # Markdown post-processing and cleanup
├── config.py # TOML config file loading (.repulp.toml / pyproject.toml)
├── fetcher.py # URL fetching via httpx
├── frontmatter.py # YAML frontmatter injection
└── formatter.py # Output format handling (md, text, json)
Libraries Used
repulp is built on top of these libraries:
| Library | Purpose |
|---|---|
| MarkItDown | Core document-to-Markdown conversion engine by Microsoft. Handles PDF, DOCX, PPTX, XLSX, HTML, CSV, images, audio, and more. |
| Typer | CLI framework built on Click. Provides argument parsing, help generation, and shell completion. |
| Rich | Terminal formatting — progress bars, tables, panels, colored output. |
| watchfiles | Rust-backed file watcher. Used for the watch command to detect file changes with low latency. |
| httpx | HTTP client for URL fetching. Used when converting URLs to Markdown. |
| pandas | (optional) DataFrame support for structured table extraction. Install with pip install repulp[tables]. |
| tomli | TOML parser for .repulp.toml config files. Only needed on Python < 3.11 (3.11+ has tomllib in stdlib). |
Build & Dev Tools
| Tool | Purpose |
|---|---|
| hatchling | Build backend for packaging |
| uv | Fast Python package manager |
| pytest | Test framework (162 tests) |
Contributing
Contributions are welcome! Here's how to get started:
Setup
git clone https://github.com/5unnykum4r/repulp.git
cd repulp
uv sync --group dev
Running Tests
uv run pytest tests/ -v
Project Conventions
- Python 3.10+ — uses
from __future__ import annotationsfor modern type hints - No vague comments — code should be self-documenting; comments explain why, not what
- Tests live in
tests/— mirror the source structure (e.g.,test_engine.pytestsengine.py) - Incremental commits — one logical change per commit
How to Contribute
- Fork the repository
- Create a feature branch (
git checkout -b feat/my-feature) - Write tests for your changes
- Make sure all tests pass (
uv run pytest tests/ -v) - Commit your changes with a descriptive message
- Push to your fork and open a Pull Request
Areas for Contribution
- Adding support for new output formats
- Performance improvements to the batch engine
- Better error messages and diagnostics
- Documentation improvements
- New extraction types (e.g., code blocks, footnotes)
Samples
The samples/ directory contains example files (HTML, CSV) that demonstrate repulp's conversion capabilities:
# Convert all samples
repulp convert samples/ --output samples/converted --workers 4 --no-cache
# Extract tables from a sample
repulp extract tables samples/architecture.html --format json
License
MIT — Sunny Kumar
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file repulp-0.1.0.tar.gz.
File metadata
- Download URL: repulp-0.1.0.tar.gz
- Upload date:
- Size: 167.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68fb5108e2c8d7676711b70ec6afe0810a8d9c04ff1c5dc656b978a2e6797b2f
|
|
| MD5 |
c666ed07cf62453c1407789d73afc3dd
|
|
| BLAKE2b-256 |
510ab565cd30ad4828e7582bd354ffd608b4433881b9613d4a54a1b1d2a1405b
|
Provenance
The following attestation bundles were made for repulp-0.1.0.tar.gz:
Publisher:
publish.yml on 5unnykum4r/repulp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
repulp-0.1.0.tar.gz -
Subject digest:
68fb5108e2c8d7676711b70ec6afe0810a8d9c04ff1c5dc656b978a2e6797b2f - Sigstore transparency entry: 1025584356
- Sigstore integration time:
-
Permalink:
5unnykum4r/repulp@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/5unnykum4r
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707 -
Trigger Event:
release
-
Statement type:
File details
Details for the file repulp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: repulp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07eefb9425e430cab41a4c05a544133a926c541c9941f6254a02240ce12dfa7f
|
|
| MD5 |
8df90ab950e4a67b22b933f964d6e70c
|
|
| BLAKE2b-256 |
66f2e0f35053d491eed1b1b1695e8e817b72ad8dede86789a80d7e1d1537a5fe
|
Provenance
The following attestation bundles were made for repulp-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on 5unnykum4r/repulp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
repulp-0.1.0-py3-none-any.whl -
Subject digest:
07eefb9425e430cab41a4c05a544133a926c541c9941f6254a02240ce12dfa7f - Sigstore transparency entry: 1025584426
- Sigstore integration time:
-
Permalink:
5unnykum4r/repulp@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/5unnykum4r
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2f154e6f2e4c4c2cc36d7a9751b15108ad30d707 -
Trigger Event:
release
-
Statement type: