Document OCR models for multilingual text detection and recognition

These details have not been verified by PyPI

Project links

Project description

Surya

Surya is a multilingual document OCR toolkit. It can do:

Accurate line-level text detection
Text recognition (coming soon)
Table and chart detection (coming soon)

It works on a range of documents and languages (see usage and benchmarks for more details).

New York Times Article Example

Surya is named after the Hindu sun god, who has universal vision.

Community

Discord is where we discuss future development.

Examples

Name	Text Detection
New York Times	Image
Japanese	Image
Chinese	Image
Hindi	Image
Presentation	Image
Scientific Paper	Image
Scanned Document	Image
Scanned Form	Image

Installation

You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.

Install with:

pip install surya-ocr

Model weights will automatically download the first time you run surya.

Usage

Inspect the settings in surya/settings.py. You can override any settings with environment variables.
Your torch device will be automatically detected, but you can override this. For example, TORCH_DEVICE=cuda. Note that the mps device has a bug (on the Apple side) that may prevent it from working properly.

Text line detection

You can detect text lines in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected bboxes, and optionally save images of the pages with the bboxes.

surya_detect DATA_PATH --images

DATA_PATH can be an image, pdf, or folder of images/pdfs
--images will save images of the pages and detected text lines (optional)
--max specifies the maximum number of pages to process if you don't want to process everything
--results_dir specifies the directory to save results to instead of the default

The results.json file will contain these keys for each page of the input document(s):

polygons - polygons for each detected text line (these are more accurate than the bboxes) in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
bboxes - axis-aligned rectangles for each detected text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
vertical_lines - vertical lines detected in the document in (x1, y1, x2, y2) format.
horizontal_lines - horizontal lines detected in the document in (x1, y1, x2, y2) format.
page_number - the page number of the document

Performance tips

Setting the DETECTOR_BATCH_SIZE env var properly will make a big difference when using a GPU. Each batch item will use 280MB of VRAM, so very high batch sizes are possible. The default is a batch size 32, which will use about 9GB of VRAM.

Depending on your CPU core count, DETECTOR_BATCH_SIZE might make a difference there too - the default CPU batch size is 2.

You can adjust DETECTOR_NMS_THRESHOLD and DETECTOR_TEXT_THRESHOLD if you don't get good results. Try lowering them to detect more text, and vice versa.

From Python

You can also do text detection from code with:

from PIL import Image
from surya.detection import batch_inference
from surya.model.segformer import load_model, load_processor

image = Image.open(IMAGE_PATH)
model, processor = load_model(), load_processor()

# predictions is a list of dicts, one per image
predictions = batch_inference([image], model, processor)

Text recognition

Coming soon.

Table and chart detection

Coming soon.

Manual install

If you want to develop surya, you can install it manually:

git clone https://github.com/VikParuchuri/surya.git
cd surya
poetry install # Installs main and dev dependencies

Limitations

This is specialized for document OCR. It will likely not work on photos or other images.
It is for printed text, not handwriting.
The model has trained itself to ignore advertisements.
This has worked for every language I've tried, but languages with very different character sets may not work well.

Benchmarks

Text line detection

Benchmark chart

Model	Time (s)	Time per page (s)	precision	recall
surya	52.6892	0.205817	0.844426	0.937818
tesseract	74.4546	0.290838	0.631498	0.997694

Tesseract is CPU-based, and surya is CPU or GPU. I ran the benchmarks on a system with an A6000 GPU, and a 32 core CPU. This was the resource usage:

tesseract - 32 CPU cores, or 8 workers using 4 cores each
surya - 32 batch size, for 9GB VRAM usage

Methodology

Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's also hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.

I instead used coverage, which calculates:

Precision - how well predicted bboxes cover ground truth bboxes
Recall - how well ground truth bboxes cover predicted bboxes

First calculate coverage for each bbox, then add a small penalty for double coverage, since we want the detection to have non-overlapping bboxes. Anything with a coverage of 0.5 or higher is considered a match.

Then we calculate precision and recall for the whole dataset.

Running your own benchmarks

You can benchmark the performance of surya on your machine.

Follow the manual install instructions above.
poetry install --group dev # Installs dev dependencies

Text line detection

This will evaluate tesseract and surya for text line detection across a randomly sampled set of images from doclaynet.

python benchmark/detection.py --max 256

--max controls how many images to process for the benchmark
--debug will render images and detected bboxes
--pdf_path will let you specify a pdf to benchmark instead of the default data
--results_dir will let you specify a directory to save results to instead of the default one

Training

This was trained on 4x A6000s for about 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified segformer architecture that reduces inference RAM requirements.

Commercial usage

Text detection

The text detection model was trained from scratch, so it's okay for commercial usage. The weights are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period.

If you want to remove the GPL license requirements for inference or use the weights commercially over the revenue limit, please contact me at surya@vikas.sh for dual licensing.

Thanks

This work would not have been possible without amazing open source AI work:

Segformer from NVIDIA
transformers from huggingface
CRAFT, a great scene text detection model

Thank you to everyone who makes open source AI possible.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.17.1

Jan 30, 2026

0.17.0

Sep 23, 2025

0.16.7

Sep 8, 2025

0.16.6

Sep 8, 2025

0.16.5

Sep 8, 2025

0.16.4

Sep 8, 2025

0.16.3

Sep 5, 2025

0.16.2

Sep 5, 2025

0.16.1

Sep 2, 2025

0.16.0

Aug 29, 2025

0.15.4

Aug 12, 2025

0.15.3

Aug 9, 2025

0.15.2

Aug 4, 2025

0.15.1

Aug 4, 2025

0.15.0

Aug 4, 2025

0.14.7

Jul 25, 2025

0.14.6

Jun 11, 2025

0.14.5

Jun 2, 2025

0.14.4

Jun 2, 2025

0.14.3

May 29, 2025

0.14.2

May 20, 2025

0.14.1

May 16, 2025

0.14.0

May 13, 2025

0.14.0a4 pre-release

Apr 24, 2025

0.14.0a3 pre-release

Apr 18, 2025

0.14.0a2 pre-release

Apr 14, 2025

0.14.0a1 pre-release

Apr 8, 2025

0.13.1

Mar 26, 2025

0.13.0

Feb 28, 2025

0.12.1

Feb 24, 2025

0.12.0

Feb 19, 2025

0.11.1

Feb 13, 2025

0.11.0

Feb 10, 2025

0.10.3

Feb 6, 2025

0.10.2

Jan 31, 2025

0.10.1

Jan 30, 2025

0.10.0

Jan 29, 2025

0.9.3

Jan 24, 2025

0.9.2

Jan 24, 2025

0.9.1

Jan 23, 2025

0.9.0

Jan 22, 2025

0.8.3

Jan 1, 2025

0.8.2

Dec 30, 2024

0.8.1

Dec 19, 2024

0.8.0

Dec 12, 2024

0.7.0

Nov 27, 2024

0.6.13

Oct 30, 2024

0.6.12

Oct 29, 2024

0.6.11

Oct 25, 2024

0.6.10

Oct 23, 2024

0.6.9

Oct 23, 2024

0.6.8

Oct 22, 2024

0.6.7

Oct 22, 2024

0.6.6

Oct 18, 2024

0.6.5

Oct 18, 2024

0.6.4

Oct 18, 2024

0.6.3

Oct 14, 2024

0.6.2

Oct 14, 2024

0.6.1

Oct 8, 2024

0.6.0

Oct 8, 2024

0.5.0

Aug 16, 2024

0.4.15

Jul 12, 2024

0.4.14

Jun 30, 2024

0.4.12

May 28, 2024

0.4.11

May 28, 2024

0.4.10

May 28, 2024

0.4.9

May 27, 2024

0.4.8

May 23, 2024

0.4.7

May 18, 2024

0.4.6

May 17, 2024

0.4.5

May 16, 2024

0.4.4

May 9, 2024

0.4.3

May 8, 2024

0.4.2

May 8, 2024

0.4.1

May 8, 2024

0.4.0

Apr 22, 2024

0.3.0

Mar 26, 2024

0.2.4

Mar 6, 2024

0.2.3

Mar 6, 2024

0.2.2

Feb 16, 2024

0.2.1

Feb 12, 2024

0.2.0

Feb 12, 2024

This version

0.1.6

Jan 16, 2024

0.1.5

Jan 12, 2024

0.1.4

Jan 12, 2024

0.1.2

Jan 12, 2024

0.1.1

Jan 12, 2024

0.1.0

Jan 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

surya_ocr-0.1.6.tar.gz (29.7 kB view details)

Uploaded Jan 16, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

surya_ocr-0.1.6-py3-none-any.whl (30.8 kB view details)

Uploaded Jan 16, 2024 Python 3

File details

Details for the file surya_ocr-0.1.6.tar.gz.

File metadata

Download URL: surya_ocr-0.1.6.tar.gz
Upload date: Jan 16, 2024
Size: 29.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.11.7 Linux/6.2.0-1018-azure

File hashes

Hashes for surya_ocr-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`beb1fe912e73767e48dc8f28e14c7c5a65f1bfd7c4858fffc5cc76dbdcf24c15`
MD5	`1873fb4b769cca73676125f46db041dc`
BLAKE2b-256	`0260089143551998c64ca3f06ccbdf1635f1a75b7fdedfba8671c38e7b169325`

See more details on using hashes here.

File details

Details for the file surya_ocr-0.1.6-py3-none-any.whl.

File metadata

Download URL: surya_ocr-0.1.6-py3-none-any.whl
Upload date: Jan 16, 2024
Size: 30.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.11.7 Linux/6.2.0-1018-azure

File hashes

Hashes for surya_ocr-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce69759b337b6c6e0c4f35b71fd193890e01578e5ae6d7b4c7201fd6642251b5`
MD5	`67c2f32634db4f5ba6f98de8622a0f0b`
BLAKE2b-256	`69b02718e378791e1e8f2cf8110b28db68fd4c143fd9f7eb9d00cd4746361571`

See more details on using hashes here.

surya-ocr 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Surya

Community

Examples

Installation

Usage

Text line detection

From Python

Text recognition

Table and chart detection

Manual install

Limitations

Benchmarks

Text line detection

Running your own benchmarks

Training

Commercial usage

Thanks

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes