Skip to main content

Convert PDF documents to Markdown and structured JSON for RAG and LLM pipelines

Project description

PDF2MJ

Convert PDF documents to Markdown and structured JSON for RAG pipelines, LLM preprocessing, and knowledge bases.

Installation (For Users)

PyPI: pdf2mj is not published on PyPI yet. Install from source (see Development Setup) or publish the package first.

When available on PyPI:

pip install pdf2mj

With OCR support:

pip install "pdf2mj[ocr]"

OCR Requirements

OCR is optional and requires:

  • Tesseract OCR installed on your system
  • OCR extras installed via:
pip install "pdf2mj[ocr]"

First Run

On the first pdf2mj invocation (no arguments), a Rich-powered welcome screen is shown once. State is stored in:

  • Linux/macOS: ~/.config/pdf2mj/config.json
  • Windows: %APPDATA%\pdf2mj\config.json
pdf2mj welcome   # show the welcome screen again
pdf2mj doctor    # verify dependencies and environment

Quick Start

Convert a PDF to Markdown and JSON:

pdf2mj document.pdf

Output files are generated next to the source PDF:

document.md
document.json

Specify an output directory:

pdf2mj document.pdf --output ./output

Common Examples

Generate all outputs:

pdf2mj document.pdf --all --output ./output

Extract images:

pdf2mj document.pdf --extract-images

Generate RAG chunks:

pdf2mj document.pdf --chunk-size 1000

Use OCR for scanned PDFs:

pdf2mj document.pdf --ocr

CLI Options

Flag Description
--markdown / --no-markdown Generate Markdown (default: on)
--json / --no-json Generate structured JSON (default: on)
--ocr OCR scanned pages
--extract-images Extract embedded images
--figures Alias for --extract-images
--chunk-size N Generate RAG chunks
--chunk-overlap N Chunk overlap (default: 200)
--output, -o Output directory
--verbose, -v Detailed logging
--metadata Export metadata JSON
--tables / --no-tables Extract tables
--all Enable all supported outputs

Utility Commands

Command Description
pdf2mj welcome Show the onboarding welcome screen
pdf2mj doctor Check Python, dependencies, OCR, and write access

Development Setup (For Contributors)

Prerequisites

  • Python 3.12+
  • Git
  • Optional: Tesseract OCR

Clone the Repository

git clone https://github.com/Ronit-Pai/pdf2mj.git
cd pdf2mj

Create a Development Environment

Using pip:

python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

pip install -e ".[dev]"

Using uv:

uv venv
source .venv/bin/activate

uv pip install -e ".[dev]"

With OCR support:

pip install -e ".[dev,ocr]"

Running Tests

pytest

Coverage:

pytest --cov=pdf2mj --cov-report=html

Project Structure

src/pdf2mj/
  cli.py
  config.py
  welcome.py
  doctor.py
  converter.py
  models.py
  markdown.py
  json_export.py
  metadata.py
  table_extractor.py
  image_extractor.py
  ocr.py
  chunker.py
  console_util.py

tests/
sample_pdfs/

Local Development

Run directly from source:

pdf2mj sample.pdf

or

python -m pdf2mj sample.pdf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mj-0.1.0.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mj-0.1.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mj-0.1.0.tar.gz.

File metadata

  • Download URL: pdf2mj-0.1.0.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf2mj-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a792e7953dd988009d4b33bc03ab8a82001bef2dccf7c110422f02577b8de225
MD5 4d87d312757ca48715fd1cc4d1995ebe
BLAKE2b-256 afc55f6dbc7ab983f22a1b57f0502c9981c6037a84129640f4e569dfc2983ed4

See more details on using hashes here.

File details

Details for the file pdf2mj-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2mj-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pdf2mj-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 574ad86f3589b48514b822f117d8da1b8c0eb308cabc0b026e90a7427d9bd1a8
MD5 b49b58d927efddc198c4b6221f63ffa7
BLAKE2b-256 7afaa2224f465df6bad0f1fd5a863a36de6595ce9cca8659de66acfe7d233c2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page