Skip to main content

A reverse OCR tool that renders text to images with various configuration options. Supports huggingface datasets.

Project description

DeOCR

DeOCR (de-cor), A reverse OCR tool that renders huggingface-compatible datasets to configurable images (e.g., custom size 512x512, black background, paddings, margins, etc.). This tool can be considered as a text-to-image data pre-processing component in pipelines such as DeepSeek-OCR.

---
title: DeOCR Usage in LLM Pipeline
---
flowchart LR
  TEXTDATA[/"some context in text form"/]
  MMDATA[/"Does this particular car <br/> &lt;image&gt; present in here &lt;image&gt; ?"/]
  HFDATASET[("huggingface dataset")] 
  subgraph DeOCR
    CSS1["cli --style red-text textit"]
    CSS2["cli --style default"]
    CSS3["cli --style default"]
    MAPPER["DeOCR Dataset Mapper"]
  end
  TEXTDATA --> CSS1 --> IMG1[["some context in img form"]]:::redText
  TEXTDATA --> CSS2 --> IMG2[["some context in img form"]]
  MMDATA --> CSS3 --> IMG3[["Does this particular car <br/> 🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>🖼️🖼️🖼️🚗🖼️🖼️🖼️<br/>🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/> present in here <br/> 🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>?"]]
  HFDATASET --> MAPPER --> DEOCRDATASET[("🖼️ imagified dataset")]
  DEOCRDATASET & IMG1 & IMG2 & IMG3 -.-> MODEL["LLMs or VLMs<br/> Evaluation"]
  classDef redText color:#ff0000,font-style:italic;
  IMG1 ~~~|"fa:fa-mobile-screen A screenshot of text <br/>w. special formatting"| IMG1
  IMG2 ~~~|"fa:fa-mobile-screen A plain screenshot of text"| IMG2
  IMG3 ~~~|"fa:fa-mobile-screen A screenshot of both text and images"| IMG3
Here is an output example, sized `512x512`, with random string as context

a 512x512 example

Quick Start

pip install deocr[playwright,pymupdf]
# activate your python environment, then install playwright deps
playwright install chromium
Alternatively, install from source
# uv
uv add "deocr[playwright,pymupdf] @ git+https://github.com/Moenupa/DeOCR.git"
# activate your python environment, then install playwright deps
playwright install chromium
For development

Please use uv to manage the environment:

git clone https://github.com/Moenupa/DeOCR.git
cd DeOCR
uv venv
uv sync --all-extras --all-groups
source .venv/bin/activate
playwright install chromium
pre-commit install
Known Issues

Performance

DeOCR is mainly optimized by asynchronous rendering and multiprocessing dataset mapping. The rendering speed may vary depending on the machine configuration and the complexity of the text to be rendered. On a standard machine with 32 cores, DeOCR can render more than 1k images per second.

GSM8K dataset (one 512x512 image per sample) rendering speed with Intel Xeon Gold 6430:

# increase MAX_ASYNC_PAGES for more cores
$ MAX_ASYNC_PAGES=1 python tests/dataset/manual_load.py
Map (num_proc=1): 100%|██████████████| 7473/7473 [02:48<00:00, 44.33 examples/s]
Map (num_proc=1): 100%|██████████████| 1319/1319 [00:27<00:00, 47.28 examples/s]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deocr-0.3.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deocr-0.3.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file deocr-0.3.0.tar.gz.

File metadata

  • Download URL: deocr-0.3.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.7

File hashes

Hashes for deocr-0.3.0.tar.gz
Algorithm Hash digest
SHA256 8bcaab4b8353e61c966ec3ad6c2683dabbde2ea408cbf30663b4bec1f0fbbe3f
MD5 e40de830f1f22a12d764796854e8fc5c
BLAKE2b-256 a596643a1814927df7e71fc578ae2c2857fb21aab89022fa206047a9ea1a1a55

See more details on using hashes here.

File details

Details for the file deocr-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: deocr-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.7

File hashes

Hashes for deocr-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2a11b7136fd61113e1cae002ef0fffd6cdcabaf9baa65898e9cac29c2e47778
MD5 dc435e7f685340554cadb6e510de99a9
BLAKE2b-256 4ef5c323c4b4301a2e3f7f62756db48a5d854b5f1e9641c597a912baf275bdbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page