OCR model that converts documents to markdown, HTML, or JSON.

These details have not been verified by PyPI

Project description

Chandra

Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.

Features

Convert documents to markdown, html, or json with detailed layout information
Good handwriting support
Reconstructs forms accurately, including checkboxes
Good support for tables, math, and complex layouts
Extracts images and diagrams, with captions and structured data
Support for 40+ languages
Two inference modes: local (HuggingFace) and remote (vLLM server)

Hosted API

We have a hosted API for Chandra here, which also includes other accuracy improvements and document workflows.
There is a free playground here if you want to try it out without installing.

Quickstart

The easiest way to start is with the CLI tools:

pip install chandra-ocr

# With VLLM
chandra_vllm
chandra input.pdf ./output

# With HuggingFace
chandra input.pdf ./output --method hf

# Interactive streamlit app
chandra_app

Benchmarks

These are overall scores on the olmocr bench.

See full scores below.

Examples

Type	Name	Link
Tables	Water Damage Form	View
Tables	10K Filing	View
Forms	Handwritten Form	View
Forms	Lease Agreement	View
Handwriting	Doctor Note	View
Handwriting	Math Homework	View
Books	Geography Textbook	View
Books	Exercise Problems	View
Math	Attention Diagram	View
Math	Worksheet	View
Math	EGA Page	View
Newspapers	New York Times	View
Newspapers	LA Times	View
Other	Transcript	View
Other	Flowchart	View

Community

Discord is where we discuss future development.

Installation

Package

pip install chandra-ocr

If you're going to use the huggingface method, we also recommend installing flash attention.

From Source

git clone https://github.com/datalab-to/chandra.git
cd chandra
uv sync
source .venv/bin/activate

Usage

CLI

Process single files or entire directories:

# Single file, with vllm server (see below for how to launch vllm)
chandra input.pdf ./output --method vllm

# Process all files in a directory with local model
chandra ./documents ./output --method hf

CLI Options:

--method [hf|vllm]: Inference method (default: vllm)
--page-range TEXT: Page range for PDFs (e.g., "1-5,7,9-12")
--max-output-tokens INTEGER: Max tokens per page
--max-workers INTEGER: Parallel workers for vLLM
--include-images/--no-images: Extract and save images (default: include)
--include-headers-footers/--no-headers-footers: Include page headers/footers (default: exclude)
--batch-size INTEGER: Pages per batch (default: 1)

Output Structure:

Each processed file creates a subdirectory with:

<filename>.md - Markdown output
<filename>.html - HTML output
<filename>_metadata.json - Metadata (page info, token count, etc.)
images/ - Extracted images from the document

Streamlit Web App

Launch the interactive demo for single-page processing:

chandra_app

vLLM Server (Optional)

For production deployments or batch processing, use the vLLM server:

chandra_vllm

This launches a Docker container with optimized inference settings. Configure via environment variables:

VLLM_API_BASE: Server URL (default: http://localhost:8000/v1)
VLLM_MODEL_NAME: Model name for the server (default: chandra)
VLLM_GPUS: GPU device IDs (default: 0)

You can also start your own vllm server with the datalab-to/chandra model.

Configuration

Settings can be configured via environment variables or a local.env file:

# Model settings
MODEL_CHECKPOINT=datalab-to/chandra
MAX_OUTPUT_TOKENS=8192

# vLLM settings
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL_NAME=chandra
VLLM_GPUS=0

Commercial usage

This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page here.

Benchmark table

Model	ArXiv	Old Scans Math	Tables	Old Scans	Headers and Footers	Multi column	Long tiny text	Base	Overall	Source
Datalab Chandra v0.1.0	82.2	80.3	88.0	50.4	90.8	81.2	92.3	99.9	83.1 ± 0.9	Own benchmarks
Datalab Marker v1.10.0	83.8	69.7	74.8	32.3	86.6	79.4	85.7	99.6	76.5 ± 1.0	Own benchmarks
Mistral OCR API	77.2	67.5	60.6	29.3	93.6	71.3	77.1	99.4	72.0 ± 1.1	olmocr repo
Deepseek OCR	75.2	72.3	79.7	33.3	96.1	66.7	80.1	99.7	75.4 ± 1.0	Own benchmarks
GPT-4o (Anchored)	53.5	74.5	70.0	40.7	93.8	69.3	60.6	96.8	69.9 ± 1.1	olmocr repo
Gemini Flash 2 (Anchored)	54.5	56.1	72.1	34.2	64.7	61.5	71.5	95.6	63.8 ± 1.2	olmocr repo
Qwen 3 VL 8B	70.2	75.1	45.6	37.5	89.1	62.1	43.0	94.3	64.6 ± 1.1	Own benchmarks
olmOCR v0.3.0	78.6	79.9	72.9	43.9	95.1	77.3	81.2	98.9	78.5 ± 1.1	olmocr repo
dots.ocr	82.1	64.2	88.3	40.9	94.1	82.4	81.2	99.5	79.1 ± 1.0	dots.ocr repo

Credits

Thank you to the following open source projects:

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Mar 18, 2026

This version

0.1.8

Oct 26, 2025

0.1.7

Oct 22, 2025

0.1.6

Oct 21, 2025

0.1.3

Oct 21, 2025

0.1.2

Oct 21, 2025

0.1.1

Oct 21, 2025

0.1.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chandra_ocr-0.1.8.tar.gz (26.0 kB view details)

Uploaded Oct 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chandra_ocr-0.1.8-py3-none-any.whl (27.4 kB view details)

Uploaded Oct 26, 2025 Python 3

File details

Details for the file chandra_ocr-0.1.8.tar.gz.

File metadata

Download URL: chandra_ocr-0.1.8.tar.gz
Upload date: Oct 26, 2025
Size: 26.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.22

File hashes

Hashes for chandra_ocr-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`70c9c17b0d2e64563ecc9a1ec671ca8e6da28ecacbdd3beb714fd315ba5331cd`
MD5	`f8f8215ef5d998999b1f300d36c38f87`
BLAKE2b-256	`1763abe04c333fc1199f8c71ea1da8554cdd51a571f385ba774c161110e2d11d`

See more details on using hashes here.

File details

Details for the file chandra_ocr-0.1.8-py3-none-any.whl.

File metadata

Download URL: chandra_ocr-0.1.8-py3-none-any.whl
Upload date: Oct 26, 2025
Size: 27.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.22

File hashes

Hashes for chandra_ocr-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90accaa621da62994d35ce37ae00b203813ba3660a615fabb277a2ec2c6d8d21`
MD5	`05907c8c8a331a17877d8f78a26e2be5`
BLAKE2b-256	`958942b95cd88911ed255debb2e35fbcc38bcd176e62e1a137414dd974eaa6b0`

See more details on using hashes here.

chandra-ocr 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Chandra

Features

Hosted API

Quickstart

Benchmarks

Examples

Community

Installation

Package

From Source

Usage

CLI

Streamlit Web App

vLLM Server (Optional)

Configuration

Commercial usage

Benchmark table

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes