Skip to main content

A reverse OCR tool that renders huggingface-compatible datasets to configurable images (e.g., custom size `512x512`, black background, paddings, margins, etc.).

Project description

DeOCR

DeOCR (de-cor), A reverse OCR tool that renders huggingface-compatible datasets to configurable images (e.g., custom size 512x512, black background, paddings, margins, etc.). This tool can be considered as a text-to-image data pre-processing component in pipelines such as DeepSeek-OCR.

---
title: DeOCR Usage in LLM Pipeline
---
flowchart LR
  TEXTDATA[/"some context in text form"/]
  MMDATA[/"Does this particular car <br/> &lt;image&gt; present in here &lt;image&gt; ?"/]
  HFDATASET[("huggingface dataset")] 
  subgraph DeOCR
    CSS1["cli --style red-text textit"]
    CSS2["cli --style default"]
    CSS3["cli --style default"]
    MAPPER["DeOCR Dataset Mapper"]
  end
  TEXTDATA --> CSS1 --> IMG1[["some context in text form"]]:::redText
  TEXTDATA --> CSS2 --> IMG2[["some context in text form"]]
  MMDATA --> CSS3 --> IMG3[["Does this particular car <br/> 🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>🖼️🖼️🖼️🚗🖼️🖼️🖼️<br/>🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/> present in here <br/> 🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>🖼️🖼️🖼️🖼️🖼️🖼️🖼️<br/>?"]]
  HFDATASET --> MAPPER --> DEOCRDATASET[("🖼️ imagified dataset")]
  DEOCRDATASET & IMG1 & IMG2 & IMG3 -.-> MODEL["LLMs or VLMs<br/> Evaluation"]
  classDef redText color:#ff0000,font-style:italic;
  IMG1 ~~~|"fa:fa-mobile-screen A screenshot of text <br/>w. special formatting"| IMG1
  IMG2 ~~~|"fa:fa-mobile-screen A plain screenshot of text"| IMG2
  IMG3 ~~~|"fa:fa-mobile-screen A screenshot of both text and images"| IMG3
Here is an output example, sized `512x512`, with random string as context

a 512x512 example

Quick Start

pip install deocr[playwright,pymupdf]
# activate your python environment, then install playwright deps
playwright install chromium
Alternatively, install from source
# uv
uv add "deocr[playwright,pymupdf] @ git+https://github.com/Moenupa/DeOCR.git"
# activate your python environment, then install playwright deps
playwright install chromium
For development

Please use uv to manage the environment:

git clone https://github.com/Moenupa/DeOCR.git
cd DeOCR
uv venv
uv sync --all-extras --all-groups
source .venv/bin/activate
playwright install chromium
pre-commit install
Known Issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deocr-0.2.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deocr-0.2.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file deocr-0.2.0.tar.gz.

File metadata

  • Download URL: deocr-0.2.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.7

File hashes

Hashes for deocr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 128c53f99af6c6e602d8e7fe7b2454a72098044d126995c291b0120ed401211d
MD5 21fcef50f85c3158b26e5eb5f4a69d91
BLAKE2b-256 b001d7768f8a0317475d2e2085a54965a85fea54db168692e6a277d4f1ec670f

See more details on using hashes here.

File details

Details for the file deocr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: deocr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.7

File hashes

Hashes for deocr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fde271558c5f9b39cbdf8f077861ff74628108d1e210b3798a6e60eb640e3382
MD5 4c6e8dad383be44ed4142c5debf81a0f
BLAKE2b-256 e84a4a6d3bd2241e67d697fab3cfd26ac17a1e6c2d0a6da5d2af2b4096c20d78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page