Skip to main content

Towards OCR-2.0.

Project description

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

Release

  • [2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this repo. We also have submitted it to Arxiv.
  • [2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!

Code License Data License

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.

Community contributions

We encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :

Colab of GOT ~ contributor: @Zizhe Wang

CPU version of GOT ~ contributor: @ElvisClaros

Contents


Towards OCR-2.0 via a Unified End-to-end Model


Install

  1. Our environment is cuda11.8+torch2.0.1
  2. Clone this repository and navigate to the GOT folder
git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
cd 'the GOT folder'
  1. Install Package
conda create -n got python=3.10 -y
conda activate got
pip install -e .
  1. Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation

GOT Weights

Demo

  1. plain texts OCR:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type ocr
  1. format texts OCR:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format
  1. fine-grained OCR:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --box [x1,y1,x2,y2]
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --color red/green/blue
  1. multi-crop OCR:
python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /an/image/file.png 
  1. multi-page OCR (the image path contains multiple .png files):
python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /images/path/  --multi-page
  1. render the formatted OCR results:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format --render

Note: The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.

Train

  1. Train sample can be found here. Note that the '<image>' in the 'conversations'-'human'-'value' is necessary!
  2. This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
  3. If you want train from stage-1 described in our paper, you need this repo.
deepspeed   /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
 --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json    --model_name_or_path /GOT_weights/ \
 --use_im_start_end True   \
 --bf16 True   \
 --gradient_accumulation_steps 2    \
 --evaluation_strategy "no"   \
 --save_strategy "steps"  \
 --save_steps 200   \
 --save_total_limit 1   \
 --weight_decay 0.    \
 --warmup_ratio 0.001     \
 --lr_scheduler_type "cosine"    \
 --logging_steps 1    \
 --tf32 True     \
 --model_max_length 8192    \
 --gradient_checkpointing True   \
 --dataloader_num_workers 8    \
 --report_to none  \
 --per_device_train_batch_size 2    \
 --num_train_epochs 1  \
 --learning_rate 2e-5   \
 --datasets pdf-ocr+scence \
 --output_dir /your/output.path

Note:

  1. Change the corresponding data information in constant.py.
  2. Change line 37 in conversation_dataset_qwen.py to your data_name.

Eval

  1. We use the Fox and OneChart benchmarks, and other benchmarks can be found in the weights download link.
  2. The eval codes can be found in GOT/eval.
  3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs, the --num-chunks can be set to 8.
python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path  /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR

Contact

If you are interested in this work or have questions about the code or the paper, please join our communication Wechat group.

Acknowledgement

  • Vary: the codebase we built upon!
  • Qwen: the LLM base model of Vary, which is good at both English and Chinese!

Citation

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stepGOT-0.1.0.tar.gz (50.0 kB view details)

Uploaded Source

Built Distribution

stepGOT-0.1.0-py3-none-any.whl (79.3 kB view details)

Uploaded Python 3

File details

Details for the file stepGOT-0.1.0.tar.gz.

File metadata

  • Download URL: stepGOT-0.1.0.tar.gz
  • Upload date:
  • Size: 50.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for stepGOT-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d243460d4b9a5fb9119f54cdeb736e21da13bd8c1065132597e0c35188e3d959
MD5 5ae9f9df5a4cf31952b676b41482265d
BLAKE2b-256 42599261215d6f8ef95a43c39f414ff3b72860afe6a98348a173bc1530760d14

See more details on using hashes here.

File details

Details for the file stepGOT-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: stepGOT-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 79.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for stepGOT-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d521a68294ffefb64615ea75e606b251ea0497f82db6ea65e944440b8ebed0a2
MD5 ca9704e32dd4dd653ea9b96aa0dc5fa9
BLAKE2b-256 216ae10742740eb23aa5aaf1f5aee62a24ba8b4fa85261804b32cfce40703952

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page