Skip to main content

Convert PDF documents to Markdown and structured JSON for RAG and LLM pipelines

Project description

PDF2MJ

Convert PDF documents to Markdown and structured JSON for RAG pipelines, LLM preprocessing, and knowledge bases.

Installation (For Users)

PyPI: pdf2mj is not published on PyPI yet. Install from source (see Development Setup) or publish the package first.

When available on PyPI:

pip install pdf2mj

With OCR support:

pip install "pdf2mj[ocr]"

OCR Requirements

OCR is optional and requires:

  • Tesseract OCR installed on your system
  • OCR extras installed via:
pip install "pdf2mj[ocr]"

First Run

On the first pdf2mj invocation (no arguments), a Rich-powered welcome screen is shown once. State is stored in:

  • Linux/macOS: ~/.config/pdf2mj/config.json
  • Windows: %APPDATA%\pdf2mj\config.json
pdf2mj welcome   # show the welcome screen again
pdf2mj doctor    # verify dependencies and environment

Quick Start

Convert a PDF to Markdown and JSON:

pdf2mj document.pdf

Output files are generated next to the source PDF:

document.md
document.json

Specify an output directory:

pdf2mj document.pdf --output ./output

Common Examples

Generate all outputs:

pdf2mj document.pdf --all --output ./output

Extract images:

pdf2mj document.pdf --extract-images

Generate RAG chunks:

pdf2mj document.pdf --chunk-size 1000

Use OCR for scanned PDFs:

pdf2mj document.pdf --ocr

CLI Options

Flag Description
--markdown / --no-markdown Generate Markdown (default: on)
--json / --no-json Generate structured JSON (default: on)
--ocr OCR scanned pages
--extract-images Extract embedded images
--figures Alias for --extract-images
--chunk-size N Generate RAG chunks
--chunk-overlap N Chunk overlap (default: 200)
--output, -o Output directory
--verbose, -v Detailed logging
--metadata Export metadata JSON
--tables / --no-tables Extract tables
--all Enable all supported outputs

Utility Commands

Command Description
pdf2mj welcome Show the onboarding welcome screen
pdf2mj doctor Check Python, dependencies, OCR, and write access

Development Setup (For Contributors)

Prerequisites

  • Python 3.12+
  • Git
  • Optional: Tesseract OCR

Clone the Repository

git clone https://github.com/Ronit-Pai/pdf2mj.git
cd pdf2mj

Create a Development Environment

Using pip:

python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

pip install -e ".[dev]"

Using uv:

uv venv
source .venv/bin/activate

uv pip install -e ".[dev]"

With OCR support:

pip install -e ".[dev,ocr]"

Running Tests

pytest

Coverage:

pytest --cov=pdf2mj --cov-report=html

Project Structure

src/pdf2mj/
  cli.py
  config.py
  welcome.py
  doctor.py
  converter.py
  models.py
  markdown.py
  json_export.py
  metadata.py
  table_extractor.py
  image_extractor.py
  ocr.py
  chunker.py
  console_util.py

tests/
sample_pdfs/

Local Development

Run directly from source:

pdf2mj sample.pdf

or

python -m pdf2mj sample.pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mj-0.1.1.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mj-0.1.1-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mj-0.1.1.tar.gz.

File metadata

  • Download URL: pdf2mj-0.1.1.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf2mj-0.1.1.tar.gz
Algorithm Hash digest
SHA256 18e83f6ac526a68f5ff72557880af004818a4c70ec26c0bfb515a5749b5d8af4
MD5 5bb2bcc6d750a1a43683426b2bcc9730
BLAKE2b-256 ec277f3451bdfc71eb74939bc54c95469405744ac24b043c039f2e7b122d0164

See more details on using hashes here.

File details

Details for the file pdf2mj-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2mj-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf2mj-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4fd83d1bb5879876c6ad8d971f32c24c6431cf4f8ce2aff84e00c9e83d11960d
MD5 a959079e0ca79d8f26e165c70067286b
BLAKE2b-256 0c260418fa54bcfa6e573102d6e9e24839ce663a017b39653b9d9852645a8e5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page