Convert PDF documents to Markdown and structured JSON for RAG and LLM pipelines

These details have not been verified by PyPI

Project links

Project description

PDF2MJ

Convert PDF documents to Markdown and structured JSON for RAG pipelines, LLM preprocessing, and knowledge bases.

Installation (For Users)

PyPI: pdf2mj is not published on PyPI yet. Install from source (see Development Setup) or publish the package first.

When available on PyPI:

pip install pdf2mj

With OCR support:

pip install "pdf2mj[ocr]"

OCR Requirements

OCR is optional and requires:

Tesseract OCR installed on your system
OCR extras installed via:

pip install "pdf2mj[ocr]"

First Run

On the first pdf2mj invocation (no arguments), a Rich-powered welcome screen is shown once. State is stored in:

Linux/macOS: ~/.config/pdf2mj/config.json
Windows: %APPDATA%\pdf2mj\config.json

pdf2mj welcome   # show the welcome screen again
pdf2mj doctor    # verify dependencies and environment

Quick Start

Convert a PDF to Markdown and JSON:

pdf2mj document.pdf

Output files are generated next to the source PDF:

document.md
document.json

Specify an output directory:

pdf2mj document.pdf --output ./output

Common Examples

Generate all outputs:

pdf2mj document.pdf --all --output ./output

Extract images:

pdf2mj document.pdf --extract-images

Generate RAG chunks:

pdf2mj document.pdf --chunk-size 1000

Use OCR for scanned PDFs:

pdf2mj document.pdf --ocr

CLI Options

Flag	Description
`--markdown` / `--no-markdown`	Generate Markdown (default: on)
`--json` / `--no-json`	Generate structured JSON (default: on)
`--ocr`	OCR scanned pages
`--extract-images`	Extract embedded images
`--figures`	Alias for `--extract-images`
`--chunk-size N`	Generate RAG chunks
`--chunk-overlap N`	Chunk overlap (default: 200)
`--output`, `-o`	Output directory
`--verbose`, `-v`	Detailed logging
`--metadata`	Export metadata JSON
`--tables` / `--no-tables`	Extract tables
`--all`	Enable all supported outputs

Utility Commands

Command	Description
`pdf2mj welcome`	Show the onboarding welcome screen
`pdf2mj doctor`	Check Python, dependencies, OCR, and write access

Development Setup (For Contributors)

Prerequisites

Python 3.12+
Git
Optional: Tesseract OCR

Clone the Repository

git clone https://github.com/Ronit-Pai/pdf2mj.git
cd pdf2mj

Create a Development Environment

Using pip:

python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

pip install -e ".[dev]"

Using uv:

uv venv
source .venv/bin/activate

uv pip install -e ".[dev]"

With OCR support:

pip install -e ".[dev,ocr]"

Running Tests

pytest

Coverage:

pytest --cov=pdf2mj --cov-report=html

Project Structure

src/pdf2mj/
  cli.py
  config.py
  welcome.py
  doctor.py
  converter.py
  models.py
  markdown.py
  json_export.py
  metadata.py
  table_extractor.py
  image_extractor.py
  ocr.py
  chunker.py
  console_util.py

tests/
sample_pdfs/

Local Development

Run directly from source:

pdf2mj sample.pdf

python -m pdf2mj sample.pdf

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 1, 2026

This version

0.1.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mj-0.1.0.tar.gz (17.8 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2mj-0.1.0-py3-none-any.whl (18.2 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file pdf2mj-0.1.0.tar.gz.

File metadata

Download URL: pdf2mj-0.1.0.tar.gz
Upload date: Jun 1, 2026
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf2mj-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a792e7953dd988009d4b33bc03ab8a82001bef2dccf7c110422f02577b8de225`
MD5	`4d87d312757ca48715fd1cc4d1995ebe`
BLAKE2b-256	`afc55f6dbc7ab983f22a1b57f0502c9981c6037a84129640f4e569dfc2983ed4`

See more details on using hashes here.

File details

Details for the file pdf2mj-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf2mj-0.1.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 18.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf2mj-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`574ad86f3589b48514b822f117d8da1b8c0eb308cabc0b026e90a7427d9bd1a8`
MD5	`b49b58d927efddc198c4b6221f63ffa7`
BLAKE2b-256	`7afaa2224f465df6bad0f1fd5a863a36de6595ce9cca8659de66acfe7d233c2a`

See more details on using hashes here.

pdf2mj 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF2MJ

Installation (For Users)

OCR Requirements

First Run

Quick Start

Common Examples

CLI Options

Utility Commands

Development Setup (For Contributors)

Prerequisites

Clone the Repository

Create a Development Environment

Running Tests

Project Structure

Local Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes