Skip to main content

Better text detection by combining OCR engines with LLM.

Project description

BetterOCR

PyPI MIT License

🔍 Better text detection by combining OCR engines with 🧠 LLM.

OCR still sucks! ... Especially when you're from the other side of the world (and face a significant lack of training data in your language) — or just not thrilled with noisy results.

BetterOCR combines results from multiple OCR engines with an LLM to correct & reconstruct the output.

  • 🔍 OCR Engines: Currently supports EasyOCR and Tesseract.
  • 🧠 LLM: Supports models from OpenAI.
  • 📒 Custom Context: Allows users to provide an optional context to use specific keywords such as proper nouns and product names. This assists in spelling correction and noise identification, ensuring accuracy even with rare or unconventional words.

Head over to 💯 Examples to view performace by languages (🇺🇸, 🇰🇷, 🇮🇳).

Coming Soon: improved interface, async support, box detection, and more.

Warning
This package is under rapid development 🛠

Architecture

🚀 Usage (WIP)

pip install betterocr
# pip3 install betterocr
import betterocr

text = betterocr.detect_text(
    "demo.png",
    ["ko", "en"], # language codes (from EasyOCR)
    context="", # (optional) context
    tesseract={
      # Tesseract options here
      "config": "--tessdata-dir ./tessdata"
    },
    openai={
      # OpenAI options here

      # `os.environ["OPENAI_API_KEY"]` is used by default
      "API_KEY": "sk-xxxxxxx",

      # rest are used to pass params to `client.chat.completions.create`
      # `{"model": "gpt-4"}` by default
      "model": "gpt-3.5-turbo",
    },
)
print(text)
  • image_path (str)
  • lang (list[str]): Use language codes from https://www.jaided.ai/easyocr, each represented as a string within the list.
  • context (str)
  • ...

💯 Examples

Note
Results may vary due to inherent variability and potential future updates to OCR engines or the OpenAI API.

Example 1 (English with Noise)

Source Text
EasyOCR CHAINSAWMANChapter 109:The Easy Way to Stop Bullying~BV-THTSUKIFUUIMUTU ETT
Tesseract A\ ira | LT ge a TE ay NS\nye SE F Pa Ce YI AIG 44\nopr See aC\n; a) Ny 7S =u |\n_ F2 SENN\n\ ZR\n3 ~ 1 A \ Ws —— “s 7 “A\n=) 24 4 = rt fl /1\n£72 7 a NS dA Chapter 109:77/ ¢ 4\nZz % = ~ oes os | \STheEasf Way.to Stop Bullying:\n© Wa) ROT\n\n
LLM 🤖 GPT-3.5
Result CHAINSAWMAN\nChapter 109: The Easy Way to Stop Bullying

Example 2 (Korean+English)

Source Text
EasyOCR JUST FOR YOU이런 분들께 추천드리는 퍼멘테이선 팬타인 아이켜어 크림매일매일 진해지논 다크서클올 개선하고 싶다면축축 처지논 피부름 탄력 잇게 바꾸고 싶다면나날이 늘어가는 눈가 주름올 완화하고 싶다면FERMENATION민감성 피부에도 사용할 수잇는 아이크림올 찾는다면얇고 예민한 눈가 주변 피부름 관리하고 싶다면
Tesseract 9051 508 \ㅇ4\n이런 분들께 추천드리는 퍼멘테이션 타인 아이케어 크림\n.매일매일 진해지는 다크서클을 개선하고 싶다면 "도\nㆍ축축 처지는 피부를 탄력 있게 바꾸고 싶다면 7\nㆍ나날이 늘어가는 눈가 주름을 완화하고 싶다면 /\n-민감성 피부에도 사용할 수 있는 아이크림을 찾는다면 (프\nㆍ않고 예민한 눈가 주변 피부를 관리하고 싶다면 밸\n\n
LLM 🤖 GPT-3.5
Result JUST FOR YOU 이런 분들께 추천드리는 퍼멘테이션 팬타인 아이케어 크림\n매일매일 진해지는 다크서클을 개선하고 싶다면\n축축 처지는 피부를 탄력 있게 바꾸고 싶다면\n나날이 늘어가는 눈가 주름을 완화하고 싶다면\n민감성 피부에도 사용할 수 있는 아이크림을 찾는다면\n얇고 예민한 눈가 주변 피부를 관리하고 싶다면

Example 3 (Korean with custom context)

Source Text
EasyOCR 바이오함보#세로모공존존세럼6글로우픽 설문단 100인이꼼꼼하게 평가햇어요"#누적 판매액 40억#제품만족도 1009
Tesseract 바이오힐보\n#세로모공폰폰세럼\n“글로 으피 석무다 1 00인이\n꼼꼼하게평가했어요”\n\n
Context [바이오힐보] 세로모공쫀쫀세럼으로 콜라겐 타이트닝! (6S)
LLM 🤖 GPT-4
Result 바이오힐보\n#세로모공쫀쫀세럼\n글로우픽 설문단 100인이 꼼꼼하게 평가했어요\n#누적 판매액 40억\n#제품만족도 100%

🧠 LLM Reasoning

Based on the given OCR results and the context, here is the combined and corrected result:

{
  "data": "바이오힐보\n#세로모공쫀쫀세럼\n글로우픽 설문단 100인이 꼼꼼하게 평가했어요\n#누적 판매액 40억\n#제품만족도 100%"
}
  • 바이오힐보 is the correct brand name, taken from [1] and the context.
  • #세로모공쫀쫀세럼 seems to be the product name and is derived from the context.
  • 글로우픽 설문단 100인이 꼼꼼하게 평가했어요 is extracted and corrected from both OCR results.
  • #누적 판매액 40억 is taken from [0].
  • #제품만족도 100% is corrected from [0].

Example 4 (Hindi)

Source Text
EasyOCR `७नवभारतटाइम्सतोक्यो ओलिंपिक के लिए भारतीय दलका थीम सॉन्ग लॉन्च कर दिया गयाबुधवार को इस सॉन्ग को किया गया लॉन्चसिंगर मोहित चौहान ने दी है आवाज7लखेल मंत्री किरण रिजिजू ने ट्विटर पर शेयरकिया थीम सॉन्ग का वीडियो0ब४0 २०२०गीत का नाम- '्लक्ष्य तेरा सामने है' , खेलमंत्री ने ५७ सेकंड का वीडियो किया शेयर
Tesseract '8ा.\nनवभोरत टैइम्स\n\nतोक्यो ओलिंपिक के लिंए भारतीय दल\n\nका थीम सॉन्ग लॉन्च कर दिया गया\n\nबुधवार को हस सॉन्ग को किया गया लॉन्च\nसिंगर मोहित चौहान ने दी है आवाज\n\nखेल मंत्री किरण रिजिजू ने द्विटर पर शेयर\nकिया थीम सॉन्ग का वीडियो\n\nपृ 0 (९ है 0 2 0 2 0 गीत का नाम- 'लक्ष्य तेरा सामने है', खेल\n\n(2 (9९) मंत्री ने 57 सेकंड का वीडियो किया शेयर\n\n
LLM 🤖 GPT-4
Result नवभारत टाइम्स\nतोक्यो ओलिंपिक के लिए भारतीय दल का थीम सॉन्ग लॉन्च कर दिया गया\nबुधवार को इस सॉन्ग को किया गया लॉन्च\nसिंगर मोहित चौहान ने दी है आवाज\n\nखेल मंत्री किरण रिजिजू ने ट्विटर पर शेयर किया थीम सॉन्ग का वीडियो\n2020 गीत का नाम- 'लक्ष्य तेरा सामने है', खेल मंत्री ने 57 सेकंड का वीडियो किया शेयर

License (Starmie!)

MIT © Junho Yeo

If you find this project interesting, please consider giving it a star(⭐) and following me on GitHub. I code 24/7 and ship mind-breaking things on a regular basis, so your support definitely won't be in vain!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

betterocr-1.0.1.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

betterocr-1.0.1-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file betterocr-1.0.1.tar.gz.

File metadata

  • Download URL: betterocr-1.0.1.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.6 Darwin/22.1.0

File hashes

Hashes for betterocr-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d1c963817b81b64de6c767ca284e4ec98c42306b98e4f67fb15cca6f3ac49bc4
MD5 faf6d1c6a26eb1a36aa9d958e45e9851
BLAKE2b-256 7c7ed583d9e257b214325f8447b487b0acf8396683f00ca2da056382c31b1eca

See more details on using hashes here.

File details

Details for the file betterocr-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: betterocr-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.6 Darwin/22.1.0

File hashes

Hashes for betterocr-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8741714b3b43158bc031e231b83582af9701f412dd53708320cf38c867980cf1
MD5 9c3e69568c77d9c086b6928f5d797d50
BLAKE2b-256 2abda759d7277da89302069e15324affdd6dd52da92bdf58f8f8d099dd534f79

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page