OCR model that converts documents to markdown, HTML, or JSON.
Project description
Chandra
Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.
Features
- Convert documents to markdown, html, or json with detailed layout information
- Good handwriting support
- Reconstructs forms accurately, including checkboxes
- Good support for tables, math, and complex layouts
- Extracts images and diagrams, with captions and structured data
- Support for 40+ languages
- Two inference modes: local (HuggingFace) and remote (vLLM server)
Hosted API
- We have a hosted API for Chandra here, which also includes other accuracy improvements and document workflows.
- There is a free playground here if you want to try it out without installing.
Quickstart
The easiest way to start is with the CLI tools:
pip install chandra-ocr
# With VLLM
chandra_vllm
chandra input.pdf ./output
# With HuggingFace
chandra input.pdf ./output --method hf
# Interactive streamlit app
chandra_app
Benchmarks
These are overall scores on the olmocr bench.
See full scores below.
Examples
| Type | Name | Link |
|---|---|---|
| Tables | Water Damage Form | View |
| Tables | 10K Filing | View |
| Forms | Handwritten Form | View |
| Forms | Lease Agreement | View |
| Handwriting | Doctor Note | View |
| Handwriting | Math Homework | View |
| Books | Geography Textbook | View |
| Books | Exercise Problems | View |
| Math | Attention Diagram | View |
| Math | Worksheet | View |
| Math | EGA Page | View |
| Newspapers | New York Times | View |
| Newspapers | LA Times | View |
| Other | Transcript | View |
| Other | Flowchart | View |
Community
Discord is where we discuss future development.
Installation
Package
pip install chandra-ocr
If you're going to use the huggingface method, we also recommend installing flash attention.
From Source
git clone https://github.com/datalab-to/chandra.git
cd chandra
uv sync
source .venv/bin/activate
Usage
CLI
Process single files or entire directories:
# Single file, with vllm server (see below for how to launch vllm)
chandra input.pdf ./output --method vllm
# Process all files in a directory with local model
chandra ./documents ./output --method hf
CLI Options:
--method [hf|vllm]: Inference method (default: vllm)--page-range TEXT: Page range for PDFs (e.g., "1-5,7,9-12")--max-output-tokens INTEGER: Max tokens per page--max-workers INTEGER: Parallel workers for vLLM--include-images/--no-images: Extract and save images (default: include)--include-headers-footers/--no-headers-footers: Include page headers/footers (default: exclude)--batch-size INTEGER: Pages per batch (default: 1)
Output Structure:
Each processed file creates a subdirectory with:
<filename>.md- Markdown output<filename>.html- HTML output<filename>_metadata.json- Metadata (page info, token count, etc.)images/- Extracted images from the document
Streamlit Web App
Launch the interactive demo for single-page processing:
chandra_app
vLLM Server (Optional)
For production deployments or batch processing, use the vLLM server:
chandra_vllm
This launches a Docker container with optimized inference settings. Configure via environment variables:
VLLM_API_BASE: Server URL (default:http://localhost:8000/v1)VLLM_MODEL_NAME: Model name for the server (default:chandra)VLLM_GPUS: GPU device IDs (default:0)
You can also start your own vllm server with the datalab-to/chandra model.
Configuration
Settings can be configured via environment variables or a local.env file:
# Model settings
MODEL_CHECKPOINT=datalab-to/chandra
MAX_OUTPUT_TOKENS=8192
# vLLM settings
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL_NAME=chandra
VLLM_GPUS=0
Commercial usage
This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page here.
Benchmark table
| Model | ArXiv | Old Scans Math | Tables | Old Scans | Headers and Footers | Multi column | Long tiny text | Base | Overall | Source |
|---|---|---|---|---|---|---|---|---|---|---|
| Datalab Chandra v0.1.0 | 82.2 | 80.3 | 88.0 | 50.4 | 90.8 | 81.2 | 92.3 | 99.9 | 83.1 ± 0.9 | Own benchmarks |
| Datalab Marker v1.10.0 | 83.8 | 69.7 | 74.8 | 32.3 | 86.6 | 79.4 | 85.7 | 99.6 | 76.5 ± 1.0 | Own benchmarks |
| Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0 ± 1.1 | olmocr repo |
| Deepseek OCR | 75.2 | 72.3 | 79.7 | 33.3 | 96.1 | 66.7 | 80.1 | 99.7 | 75.4 ± 1.0 | Own benchmarks |
| GPT-4o (Anchored) | 53.5 | 74.5 | 70.0 | 40.7 | 93.8 | 69.3 | 60.6 | 96.8 | 69.9 ± 1.1 | olmocr repo |
| Gemini Flash 2 (Anchored) | 54.5 | 56.1 | 72.1 | 34.2 | 64.7 | 61.5 | 71.5 | 95.6 | 63.8 ± 1.2 | olmocr repo |
| Qwen 3 VL 8B | 70.2 | 75.1 | 45.6 | 37.5 | 89.1 | 62.1 | 43.0 | 94.3 | 64.6 ± 1.1 | Own benchmarks |
| olmOCR v0.3.0 | 78.6 | 79.9 | 72.9 | 43.9 | 95.1 | 77.3 | 81.2 | 98.9 | 78.5 ± 1.1 | olmocr repo |
| dots.ocr | 82.1 | 64.2 | 88.3 | 40.9 | 94.1 | 82.4 | 81.2 | 99.5 | 79.1 ± 1.0 | dots.ocr repo |
Credits
Thank you to the following open source projects:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chandra_ocr-0.1.8.tar.gz.
File metadata
- Download URL: chandra_ocr-0.1.8.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70c9c17b0d2e64563ecc9a1ec671ca8e6da28ecacbdd3beb714fd315ba5331cd
|
|
| MD5 |
f8f8215ef5d998999b1f300d36c38f87
|
|
| BLAKE2b-256 |
1763abe04c333fc1199f8c71ea1da8554cdd51a571f385ba774c161110e2d11d
|
File details
Details for the file chandra_ocr-0.1.8-py3-none-any.whl.
File metadata
- Download URL: chandra_ocr-0.1.8-py3-none-any.whl
- Upload date:
- Size: 27.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90accaa621da62994d35ce37ae00b203813ba3660a615fabb277a2ec2c6d8d21
|
|
| MD5 |
05907c8c8a331a17877d8f78a26e2be5
|
|
| BLAKE2b-256 |
958942b95cd88911ed255debb2e35fbcc38bcd176e62e1a137414dd974eaa6b0
|