Python Tamil OCR package
Project description
OCR Tamil - Easy, Accurate and Simple to use Tamil OCR
OCR Tamil can help you extract text from signboard, nameplates, storefronts etc., from Natural Scenes with high accuracy. This version of OCR is much more robust to tilted text compared to the Tesseract, Paddle OCR and Easy OCR as they are primarily built to work on the documents texts and not on natural scenes. This model is work in progress, feel free to contribute!!!
Currently supports two languages (English + Tamil). Accuracy of the model can be improved by adjusting the Text detection model as per your requirements. Achieved the accuracy of around >95% (98% NED) in validation set
Comparison between Tesseract OCR and OCR Tamil
Input Image | OCR TAMIL | Tesseract |
---|---|---|
வாழ்கவளமுடன் | க் க்கஸாரகளள௮ஊகஎளமுடன் | |
ரெடிமேட்ஸ் | NO OUTPUT | |
கோபி | NO OUTPUT | |
தாம்பரம் | NO OUTPUT | |
நெடுஞ்சாலைத் | NO OUTPUT | |
அண்ணாசாலை | NO OUTPUT |
Obtained Tesseract results using the huggingface space with Tamil as language
How to Install and Use OCR Tamil
Tested using Python 3.10 on Windows & Linux (Ubuntu 22.04) Machines
Pip
- Using PIP install
pip install ocr_tamil
- Use the below code for text recognition at word level by inserting the image_path
Text Recognition
from ocr_tamil.ocr import OCR
image_path = r"test_images\1.jpg" # insert your own path here (step 2 file location)
ocr = OCR()
texts = ocr.predict(image_path)
with open("output.txt","w",encoding="utf-8") as f:
f.write(texts)
>>>> நெடுஞ்சாலைத்
Text Detect + Recognition
- Use the below code for text detection and recognition by inserting the image_path
from ocr_tamil.ocr import OCR
image_path = r"test_images\0.jpg" # insert your own path here
ocr = OCR(detect=True)
texts = ocr.predict(image_path)
with open("output.txt","w",encoding="utf-8") as f:
f.write(texts)
>>>> கொடைக்கானல் Kodaikanal
Github
- Clone the repository
- Pip install the required modules using
pip install -r requirements.txt
- Run the below code by providing the image path
Text Recognition
from ocr_tamil.ocr import OCR
image_path = r"test_images\1.jpg" # insert your own path here
ocr = OCR()
texts = ocr.predict(image_path)
with open("output.txt","w",encoding="utf-8") as f:
f.write(texts)
>>>> நெடுஞ்சாலைத்
Text Detect + Recognition
from ocr_tamil.ocr import OCR
image_path = r"test_images\0.jpg" # insert your own path here
ocr = OCR(detect=True)
texts = ocr.predict(image_path)
with open("output.txt","w",encoding="utf-8") as f:
f.write(texts)
>>>> கொடைக்கானல் Kodaikanal
Detailed Medium tutorial can be found here.
Huggingface spaces🤗 demo can be found here
Applications
- Navigating autonomous vehicles based on the signboards
- License plate recognition
Limitations
- Unable to read the text if they are present in rotated forms
-
Currently supports Only English and Tamil Language
-
Document Text reading capability is limited. Auto identification of Paragraph, line are not supported along with Text detection model inability to detect and crop the Tamil text leads to accuracy decrease (WORKAROUND Can use your own text detection model along with OCR tamil text recognition model)
Cropped Text from Text detection Model
Character **இ** missing due to text detection model error
**?**யற்கை மூலிகைகளில் இருந்து ஈர்த்தெடுக்கக்கப்பட்ட விரிய உட்பொருட்களை உள்ளடக்கி எந்த இரசாயன சேர்க்கைகளும் **?**ல்லாமல் உருவாக்கப்பட்ட **?**ந்தியாவின் முதல் சித்த தயாரிப்பு
Thanks to the below contibuters for making awesome Text detection and text recognition models
Text detection - CRAFT TEXT DECTECTION
Text recognition - PARSEQ
@InProceedings{bautista2022parseq,
title={Scene Text Recognition with Permuted Autoregressive Sequence Models},
author={Bautista, Darwin and Atienza, Rowel},
booktitle={European Conference on Computer Vision},
pages={178--196},
month={10},
year={2022},
publisher={Springer Nature Switzerland},
address={Cham},
doi={10.1007/978-3-031-19815-1_11},
url={https://doi.org/10.1007/978-3-031-19815-1_11}
}
@inproceedings{baek2019character,
title={Character Region Awareness for Text Detection},
author={Baek, Youngmin and Lee, Bado and Han, Dongyoon and Yun, Sangdoo and Lee, Hwalsuk},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={9365--9374},
year={2019}
}
CITATION
@InProceedings{GnanaPrasath,
title={Tamil OCR},
author={Gnana Prasath D},
month={01},
year={2024},
url={https://github.com/gnana70/tamil_ocr}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ocr_tamil-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbdd2dd79affeacdfadb5cbeaa0123651cb7aeb1f34201096da569877e8d93bf |
|
MD5 | 53e03a379d41df7750f6c6fb6a63a384 |
|
BLAKE2b-256 | 5eb1cd87103a964f922163d3a4a105195f4cfee345abced3e9d86a65827e7ee0 |