Convert PDF documents to Markdown and structured JSON for RAG and LLM pipelines
Project description
PDF2MJ
Convert PDF documents to Markdown and structured JSON for RAG pipelines, LLM preprocessing, and knowledge bases.
Installation (For Users)
PyPI: pdf2mj is not published on PyPI yet. Install from source (see Development Setup) or publish the package first.
When available on PyPI:
pip install pdf2mj
With OCR support:
pip install "pdf2mj[ocr]"
OCR Requirements
OCR is optional and requires:
- Tesseract OCR installed on your system
- OCR extras installed via:
pip install "pdf2mj[ocr]"
First Run
On the first pdf2mj invocation (no arguments), a Rich-powered welcome screen is shown once. State is stored in:
- Linux/macOS:
~/.config/pdf2mj/config.json - Windows:
%APPDATA%\pdf2mj\config.json
pdf2mj welcome # show the welcome screen again
pdf2mj doctor # verify dependencies and environment
Quick Start
Convert a PDF to Markdown and JSON:
pdf2mj document.pdf
Output files are generated next to the source PDF:
document.md
document.json
Specify an output directory:
pdf2mj document.pdf --output ./output
Common Examples
Generate all outputs:
pdf2mj document.pdf --all --output ./output
Extract images:
pdf2mj document.pdf --extract-images
Generate RAG chunks:
pdf2mj document.pdf --chunk-size 1000
Use OCR for scanned PDFs:
pdf2mj document.pdf --ocr
CLI Options
| Flag | Description |
|---|---|
--markdown / --no-markdown |
Generate Markdown (default: on) |
--json / --no-json |
Generate structured JSON (default: on) |
--ocr |
OCR scanned pages |
--extract-images |
Extract embedded images |
--figures |
Alias for --extract-images |
--chunk-size N |
Generate RAG chunks |
--chunk-overlap N |
Chunk overlap (default: 200) |
--output, -o |
Output directory |
--verbose, -v |
Detailed logging |
--metadata |
Export metadata JSON |
--tables / --no-tables |
Extract tables |
--all |
Enable all supported outputs |
Utility Commands
| Command | Description |
|---|---|
pdf2mj welcome |
Show the onboarding welcome screen |
pdf2mj doctor |
Check Python, dependencies, OCR, and write access |
Development Setup (For Contributors)
Prerequisites
- Python 3.12+
- Git
- Optional: Tesseract OCR
Clone the Repository
git clone https://github.com/Ronit-Pai/pdf2mj.git
cd pdf2mj
Create a Development Environment
Using pip:
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
pip install -e ".[dev]"
Using uv:
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
With OCR support:
pip install -e ".[dev,ocr]"
Running Tests
pytest
Coverage:
pytest --cov=pdf2mj --cov-report=html
Project Structure
src/pdf2mj/
cli.py
config.py
welcome.py
doctor.py
converter.py
models.py
markdown.py
json_export.py
metadata.py
table_extractor.py
image_extractor.py
ocr.py
chunker.py
console_util.py
tests/
sample_pdfs/
Local Development
Run directly from source:
pdf2mj sample.pdf
or
python -m pdf2mj sample.pdf
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2mj-0.1.0.tar.gz.
File metadata
- Download URL: pdf2mj-0.1.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a792e7953dd988009d4b33bc03ab8a82001bef2dccf7c110422f02577b8de225
|
|
| MD5 |
4d87d312757ca48715fd1cc4d1995ebe
|
|
| BLAKE2b-256 |
afc55f6dbc7ab983f22a1b57f0502c9981c6037a84129640f4e569dfc2983ed4
|
File details
Details for the file pdf2mj-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf2mj-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
574ad86f3589b48514b822f117d8da1b8c0eb308cabc0b026e90a7427d9bd1a8
|
|
| MD5 |
b49b58d927efddc198c4b6221f63ffa7
|
|
| BLAKE2b-256 |
7afaa2224f465df6bad0f1fd5a863a36de6595ce9cca8659de66acfe7d233c2a
|